Problem Statement¶

Business Context:¶

AlphaCom, a leading telecommunications provider, has recently experienced a concerning rise in customer churn despite offering competitive services and a wide product portfolio. This increase is directly impacting revenue and undermining brand reputation in an intensely competitive market. Traditional retention strategies have proven inadequate because customer churn is influenced by a complex mix of factors, including service usage, billing preferences, contract types, and demographics. Without clear insights into these patterns, the company is left reacting to churn instead of preventing it.

Objective¶

As a data scientist at AlphaCom, you are tasked with developing a predictive model to identify customers at high risk of churn and uncover the key factors driving their decisions. Solving this problem will enable the company to proactively design targeted retention strategies, reduce churn-related losses, and improve customer lifetime value, ultimately safeguarding revenue and strengthening AlphaCom’s competitive position.

Data Description¶

The data contains different attributes related to churn. The detailed data dictionary is given below:

  • Gender: The customer’s gender (e.g., Male or Female). This demographic feature may correlate with customer behavior.
  • SeniorCitizen: A binary indicator (if included) that identifies whether the customer is a senior citizen (commonly 1 for senior, 0 for non-senior). Senior status can influence service preferences and retention strategies.
  • Partner: Indicates whether the customer has a partner. This factor can affect customer loyalty and service usage patterns.
  • Dependents: Specifies whether the customer has dependents. This information can provide context on the customer’s household and influence their service needs.
  • Tenure: The number of months the customer has been with the company. Longer tenure may indicate higher loyalty, while shorter tenure could be a churn risk indicator.
  • PhoneService: Denotes whether the customer subscribes to telephone services. This binary feature (Yes/No) helps understand service adoption.
  • MultipleLines: Indicates if the customer has multiple phone lines. This feature can provide insight into customer behavior and service complexity.
  • InternetService: Describes the type of internet service the customer uses (e.g., DSL, Fiber optic, or None). The type of internet service can be a critical factor in churn analysis.
  • OnlineSecurity: Shows whether the customer subscribes to online security services. This value (Yes/No) may influence customer satisfaction and retention.
  • OnlineBackup: Indicates if the customer has an online backup service. Similar to online security, this can be a part of the overall service bundle affecting churn.
  • DeviceProtection: Specifies whether the customer is enrolled in a device protection plan, providing an added layer of service value.
  • TechSupport: Denotes if the customer subscribes to technical support services. Access to tech support can improve customer experience and reduce churn.
  • StreamingTV: Indicates whether the customer subscribes to a streaming TV service. Media consumption patterns can be a differentiator in customer preferences.
  • StreamingMovies: Specifies if the customer subscribes to a streaming movies service. This, combined with other services, can highlight trends in customer behavior.
  • Contract: Describes the type of contract the customer holds (e.g., month-to-month, one-year, or two-year). Contract type is a strong indicator of churn risk—shorter contracts are often associated with higher churn.
  • PaperlessBilling: Indicates whether the customer is enrolled in paperless billing. This operational feature can sometimes correlate with customer engagement levels.
  • PaymentMethod: Details the payment method used by the customer (e.g., electronic check, mailed check, bank transfer, or credit card). Payment methods can affect both churn and overall customer satisfaction.
  • MonthlyCharges: The monthly amount in USD charged to the customer. Higher charges might increase the likelihood of churn if customers perceive the cost as too high for the value provided.
  • TotalCharges: The cumulative amount in USD charged over the customer’s tenure. This helps in understanding the long-term value of each customer and can be a predictor of churn.
  • Churn: The target variable indicating whether the customer has left (typically denoted as “Yes” or “No”). This is the primary outcome you aim to predict with your machine learning model.

Importing necessary libraries¶

In [1]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
import math
import re
from scipy.stats import loguniform

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
)
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.impute import KNNImputer
from sklearn.preprocessing import RobustScaler

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score, confusion_matrix

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
import statsmodels.api as statmod
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
    StackingClassifier
)
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.svm import SVC

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

# To supress warnings
import warnings
warnings.filterwarnings("ignore")
print('Lib. Import Done')
Lib. Import Done

Loading the dataset¶

In [2]:
import os
path = os.getcwd() + '\\customer_churn.csv'
data = pd.read_csv(path)
In [3]:
# copying data to another variable to avoid any changes to original data
df_data = data.copy()

Data Overview¶

Checking the shape of the dataset¶

In [4]:
# Checking the number of rows and columns in the training data
df_data.shape
Out[4]:
(12055, 20)

Displaying the first few rows of the dataset¶

In [5]:
# let's view the first 5 rows of the data
df_data.head().T
Out[5]:
0 1 2 3 4
gender Female Male Male Male Female
SeniorCitizen 0 0 0 0 0
Partner Yes No No No No
Dependents No No No No No
tenure 1.000 34.000 2.000 45.000 2.000
PhoneService No Yes Yes No Yes
MultipleLines No phone service No No No phone service No
InternetService DSL DSL DSL DSL Fiber optic
OnlineSecurity No Yes Yes Yes No
OnlineBackup Yes No Yes No No
DeviceProtection No Yes No Yes No
TechSupport No No No Yes No
StreamingTV No No No No No
StreamingMovies No No No No No
Contract Month-to-month One year Month-to-month One year Month-to-month
PaperlessBilling Yes No Yes No Yes
PaymentMethod Electronic check Mailed Check Mailed check bank transfer (automatic) ELECTRONIC CHECK
MonthlyCharges $29.85 $56.95 $53.85 $42.3 $70.7
TotalCharges $29.85 $1889.5 $108.15 $1840.75 $nan
Churn No NO YES No yes

Checking the data types of the columns for the dataset & fixing anomalies¶

In [6]:
# let's check the data types of the columns in the dataset
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12055 entries, 0 to 12054
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            12055 non-null  object 
 1   SeniorCitizen     12055 non-null  int64  
 2   Partner           12055 non-null  object 
 3   Dependents        12055 non-null  object 
 4   tenure            11451 non-null  float64
 5   PhoneService      12055 non-null  object 
 6   MultipleLines     12055 non-null  object 
 7   InternetService   12055 non-null  object 
 8   OnlineSecurity    12055 non-null  object 
 9   OnlineBackup      12055 non-null  object 
 10  DeviceProtection  12055 non-null  object 
 11  TechSupport       12055 non-null  object 
 12  StreamingTV       12055 non-null  object 
 13  StreamingMovies   12055 non-null  object 
 14  Contract          12055 non-null  object 
 15  PaperlessBilling  12055 non-null  object 
 16  PaymentMethod     12055 non-null  object 
 17  MonthlyCharges    12055 non-null  object 
 18  TotalCharges      12055 non-null  object 
 19  Churn             12055 non-null  object 
dtypes: float64(1), int64(1), object(18)
memory usage: 1.8+ MB
In [7]:
# Monthly & Total Charges should be numeric. Lets clean up by removing currencies & then convert to numeric
df_data[["MonthlyCharges", "TotalCharges"]] = (
    df_data[["MonthlyCharges", "TotalCharges"]]
    .replace({'\$|£|£': ''}, regex=True)
    .apply(pd.to_numeric, errors='coerce')
)

# For consistency, lets convert SeniorCitizen to Yes/No (object datatype) from 1/0 (int datatype)
df_data["SeniorCitizen"] = df_data["SeniorCitizen"].map({1: "Yes", 0: "No"}).astype("object")

df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12055 entries, 0 to 12054
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            12055 non-null  object 
 1   SeniorCitizen     12055 non-null  object 
 2   Partner           12055 non-null  object 
 3   Dependents        12055 non-null  object 
 4   tenure            11451 non-null  float64
 5   PhoneService      12055 non-null  object 
 6   MultipleLines     12055 non-null  object 
 7   InternetService   12055 non-null  object 
 8   OnlineSecurity    12055 non-null  object 
 9   OnlineBackup      12055 non-null  object 
 10  DeviceProtection  12055 non-null  object 
 11  TechSupport       12055 non-null  object 
 12  StreamingTV       12055 non-null  object 
 13  StreamingMovies   12055 non-null  object 
 14  Contract          12055 non-null  object 
 15  PaperlessBilling  12055 non-null  object 
 16  PaymentMethod     12055 non-null  object 
 17  MonthlyCharges    11754 non-null  float64
 18  TotalCharges      10850 non-null  float64
 19  Churn             12055 non-null  object 
dtypes: float64(3), object(17)
memory usage: 1.8+ MB
In [8]:
# Check Value_counts of all object columns
object_cols = df_data.select_dtypes(include='object').columns.tolist()

# Show value counts for each categorical column
for col in object_cols:
    print(f"\n=== Value counts for '{col}' ===")
    counts = df_data[col].value_counts(dropna=False)
    percentages = (counts / len(df_data) * 100).round(2)
    result = pd.DataFrame({'Count': counts, 'Percentage': percentages})
    print(result)
=== Value counts for 'gender' ===
        Count  Percentage
gender                   
Male     6710      55.660
Female   5345      44.340

=== Value counts for 'SeniorCitizen' ===
               Count  Percentage
SeniorCitizen                   
No             10633      88.200
Yes             1422      11.800

=== Value counts for 'Partner' ===
         Count  Percentage
Partner                   
No        6989      57.980
Yes       5066      42.020

=== Value counts for 'Dependents' ===
            Count  Percentage
Dependents                   
No           8846      73.380
Yes          3209      26.620

=== Value counts for 'PhoneService' ===
              Count  Percentage
PhoneService                   
Yes           10747      89.150
No             1308      10.850

=== Value counts for 'MultipleLines' ===
                  Count  Percentage
MultipleLines                      
Yes                5609      46.530
No                 5157      42.780
No phone service   1289      10.690

=== Value counts for 'InternetService' ===
                 Count  Percentage
InternetService                   
Fiber optic       4878      40.460
DSL               4350      36.080
No                2827      23.450

=== Value counts for 'OnlineSecurity' ===
                     Count  Percentage
OnlineSecurity                        
No                    6312      52.360
Yes                   2924      24.260
No internet service   2819      23.380

=== Value counts for 'OnlineBackup' ===
                     Count  Percentage
OnlineBackup                          
No                    5982      49.620
Yes                   3271      27.130
No internet service   2802      23.240

=== Value counts for 'DeviceProtection' ===
                     Count  Percentage
DeviceProtection                      
Yes                   4634      38.440
No                    4592      38.090
No internet service   2829      23.470

=== Value counts for 'TechSupport' ===
                     Count  Percentage
TechSupport                           
No                    6222      51.610
Yes                   3000      24.890
No internet service   2833      23.500

=== Value counts for 'StreamingTV' ===
                     Count  Percentage
StreamingTV                           
No                    5001      41.480
Yes                   4235      35.130
No internet service   2819      23.380

=== Value counts for 'StreamingMovies' ===
                     Count  Percentage
StreamingMovies                       
No                    5058      41.960
Yes                   4174      34.620
No internet service   2823      23.420

=== Value counts for 'Contract' ===
                Count  Percentage
Contract                         
Month-to-month   6554      54.370
Two year         2945      24.430
One year         2556      21.200

=== Value counts for 'PaperlessBilling' ===
                  Count  Percentage
PaperlessBilling                   
Yes                6160      51.100
No                 5895      48.900

=== Value counts for 'PaymentMethod' ===
                               Count  Percentage
PaymentMethod                                   
electronic check                 860       7.130
ELECTRONIC CHECK                 841       6.980
Electronic check                 830       6.890
Electronic Check                 816       6.770
  Electronic check               798       6.620
  Credit card (automatic)        593       4.920
Credit Card (Automatic)          589       4.890
Credit card (automatic)          586       4.860
CREDIT CARD (AUTOMATIC)          582       4.830
credit card (automatic)          580       4.810
Mailed check                     548       4.550
  Mailed check                   519       4.310
mailed check                     512       4.250
Mailed Check                     504       4.180
MAILED CHECK                     502       4.160
BANK TRANSFER (AUTOMATIC)        502       4.160
bank transfer (automatic)        494       4.100
  Bank transfer (automatic)      484       4.010
Bank Transfer (Automatic)        467       3.870
Bank transfer (automatic)        448       3.720

=== Value counts for 'Churn' ===
         Count  Percentage
Churn                     
No        3473      28.810
  No      1762      14.620
NO        1732      14.370
no        1683      13.960
Yes       1356      11.250
yes        714       5.920
YES        684       5.670
  Yes      651       5.400
In [9]:
# Fixing Data in 'Multple lines'
# Replace Yes --> Mutiple Phone Lines; No --> Single Phone Line; No phone service remains as is
df_data["MultipleLines"] = df_data["MultipleLines"].replace({
        "Yes": "Multiple phone lines",
        "No": "Single phone line" })

df_data["MultipleLines"].value_counts(dropna=False)
Out[9]:
MultipleLines
Multiple phone lines    5609
Single phone line       5157
No phone service        1289
Name: count, dtype: int64
In [10]:
# Analyze Phone Service & Multiple Lines together for any inconsistency
pd.crosstab(df_data["PhoneService"], df_data["MultipleLines"])
Out[10]:
MultipleLines Multiple phone lines No phone service Single phone line
PhoneService
No 68 1137 103
Yes 5541 152 5054
In [11]:
# Fixing Inconsistency in MultipleLines
df_data.loc[df_data["PhoneService"] == "No", "MultipleLines"] = "No phone service"
df_data.loc[
    (df_data["PhoneService"] == "Yes") & (df_data["MultipleLines"] == "No phone service"),
    "MultipleLines"
] = "Single phone line"
pd.crosstab(df_data["PhoneService"], df_data["MultipleLines"])
Out[11]:
MultipleLines Multiple phone lines No phone service Single phone line
PhoneService
No 0 1308 0
Yes 5541 0 5206
In [12]:
# Merge PhoneService & MultipleLines into one combined column
df_data["PhoneServiceStatus"] = np.where(df_data["PhoneService"] == "No", "No phone service", df_data["MultipleLines"])
df_data.drop(columns=["PhoneService", "MultipleLines"], inplace=True)
df_data["PhoneServiceStatus"].value_counts(dropna=False)
Out[12]:
PhoneServiceStatus
Multiple phone lines    5541
Single phone line       5206
No phone service        1308
Name: count, dtype: int64
In [13]:
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12055 entries, 0 to 12054
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   gender              12055 non-null  object 
 1   SeniorCitizen       12055 non-null  object 
 2   Partner             12055 non-null  object 
 3   Dependents          12055 non-null  object 
 4   tenure              11451 non-null  float64
 5   InternetService     12055 non-null  object 
 6   OnlineSecurity      12055 non-null  object 
 7   OnlineBackup        12055 non-null  object 
 8   DeviceProtection    12055 non-null  object 
 9   TechSupport         12055 non-null  object 
 10  StreamingTV         12055 non-null  object 
 11  StreamingMovies     12055 non-null  object 
 12  Contract            12055 non-null  object 
 13  PaperlessBilling    12055 non-null  object 
 14  PaymentMethod       12055 non-null  object 
 15  MonthlyCharges      11754 non-null  float64
 16  TotalCharges        10850 non-null  float64
 17  Churn               12055 non-null  object 
 18  PhoneServiceStatus  12055 non-null  object 
dtypes: float64(3), object(16)
memory usage: 1.7+ MB
In [14]:
#lets do similar data inconsistency checks on internet services
internet_services = [
    "OnlineSecurity", "OnlineBackup", "DeviceProtection",
    "TechSupport", "StreamingTV", "StreamingMovies"
]

for col in internet_services:
    print(f"\n=== Crosstab: InternetService vs {col} ===")
    print(pd.crosstab(df_data["InternetService"], df_data[col]))
=== Crosstab: InternetService vs OnlineSecurity ===
OnlineSecurity     No  No internet service   Yes
InternetService                                 
DSL              2555                   19  1776
Fiber optic      3741                    7  1130
No                 16                 2793    18

=== Crosstab: InternetService vs OnlineBackup ===
OnlineBackup       No  No internet service   Yes
InternetService                                 
DSL              2852                   12  1486
Fiber optic      3089                    7  1782
No                 41                 2783     3

=== Crosstab: InternetService vs DeviceProtection ===
DeviceProtection    No  No internet service   Yes
InternetService                                  
DSL               1936                   24  2390
Fiber optic       2643                    9  2226
No                  13                 2796    18

=== Crosstab: InternetService vs TechSupport ===
TechSupport        No  No internet service   Yes
InternetService                                 
DSL              2520                   25  1805
Fiber optic      3679                   11  1188
No                 23                 2797     7

=== Crosstab: InternetService vs StreamingTV ===
StreamingTV        No  No internet service   Yes
InternetService                                 
DSL              2706                   17  1627
Fiber optic      2276                    6  2596
No                 19                 2796    12

=== Crosstab: InternetService vs StreamingMovies ===
StreamingMovies    No  No internet service   Yes
InternetService                                 
DSL              2668                   16  1666
Fiber optic      2368                    7  2503
No                 22                 2800     5
In [15]:
# Fixing Inconsistency in Internet Services Cols
for c in internet_services:
    df_data[c] = np.where(df_data["InternetService"] == "No","No internet service",np.where(df_data[c] == "No internet service","No",df_data[c]))

for col in internet_services:
    print(f"\n=== Crosstab: InternetService vs {col} ===")
    print(pd.crosstab(df_data["InternetService"], df_data[col]))
=== Crosstab: InternetService vs OnlineSecurity ===
OnlineSecurity     No  No internet service   Yes
InternetService                                 
DSL              2574                    0  1776
Fiber optic      3748                    0  1130
No                  0                 2827     0

=== Crosstab: InternetService vs OnlineBackup ===
OnlineBackup       No  No internet service   Yes
InternetService                                 
DSL              2864                    0  1486
Fiber optic      3096                    0  1782
No                  0                 2827     0

=== Crosstab: InternetService vs DeviceProtection ===
DeviceProtection    No  No internet service   Yes
InternetService                                  
DSL               1960                    0  2390
Fiber optic       2652                    0  2226
No                   0                 2827     0

=== Crosstab: InternetService vs TechSupport ===
TechSupport        No  No internet service   Yes
InternetService                                 
DSL              2545                    0  1805
Fiber optic      3690                    0  1188
No                  0                 2827     0

=== Crosstab: InternetService vs StreamingTV ===
StreamingTV        No  No internet service   Yes
InternetService                                 
DSL              2723                    0  1627
Fiber optic      2282                    0  2596
No                  0                 2827     0

=== Crosstab: InternetService vs StreamingMovies ===
StreamingMovies    No  No internet service   Yes
InternetService                                 
DSL              2684                    0  1666
Fiber optic      2375                    0  2503
No                  0                 2827     0
In [16]:
#Lets create another columns - Internet_AddOnCount
df_data["Internet_AddOnCount"] = (df_data[internet_services] == "Yes").sum(axis=1)
In [17]:
# Fix PaymentMethod & Churn columns with standard values

# Clean 'PaymentMethod' values
df_data["PaymentMethod"] = (
    df_data["PaymentMethod"]
    .astype(str)             # ensure string type
    .str.strip()             # remove leading/trailing spaces
    .str.lower()             # convert to lowercase
    .str.replace(r'\s+', ' ', regex=True)  # normalize multiple spaces
)
payment_map = {
    "electronic check": "Electronic check",
    "credit card (automatic)": "Credit card (automatic)",
    "mailed check": "Mailed check",
    "bank transfer (automatic)": "Bank transfer (automatic)"
}

df_data["PaymentMethod"] = df_data["PaymentMethod"].map(payment_map)
df_data["PaymentMethod"].value_counts(dropna=False)
Out[17]:
PaymentMethod
Electronic check             4145
Credit card (automatic)      2930
Mailed check                 2585
Bank transfer (automatic)    2395
Name: count, dtype: int64
In [18]:
# Clean up the 'Churn' column
df_data["Churn"] = (
    df_data["Churn"]
    .astype(str)                # ensure text
    .str.strip()                # remove leading/trailing spaces
    .str.lower()                # standardize case
    .replace({'yes': 'Yes', 'no': 'No'})  # map to consistent labels
)
df_data["Churn"].value_counts(dropna=False)
Out[18]:
Churn
No     8650
Yes    3405
Name: count, dtype: int64
In [19]:
#check for negative values & replace with NAN which are logocally inconsistent in our use case
cols = ["tenure", "MonthlyCharges", "TotalCharges"]
neg_counts = (df_data[cols] < 0).sum()
print('Negative Value Check:-')
print(neg_counts)
#replace negative values with NAN
df_data[cols] = df_data[cols].mask(df_data[cols] < 0, np.nan)

neg_counts_post = (df_data[cols] < 0).sum()
print('Negative Value Check Post Treatment:-')
print(neg_counts_post)
Negative Value Check:-
tenure            127
MonthlyCharges      0
TotalCharges      147
dtype: int64
Negative Value Check Post Treatment:-
tenure            0
MonthlyCharges    0
TotalCharges      0
dtype: int64
In [20]:
#Create a new columns IsNewCustomer for all customers with tenure = 0
df_data["IsNewCustomer"] = np.where(df_data["tenure"] == 0, "Yes", "No")
df_data["IsNewCustomer"].value_counts()
Out[20]:
IsNewCustomer
No     11755
Yes      300
Name: count, dtype: int64

Checking for duplicate values¶

In [21]:
# let's check for duplicate values in the data
print('Duplicated Values: ',df_data.duplicated().sum())
Duplicated Values:  27
In [22]:
#Drop duplicate rows
df_data = df_data.drop_duplicates()
df_data = df_data.reset_index(drop=True)
print('Duplicated Values Post-dropping duplicates: ',df_data.duplicated().sum())
Duplicated Values Post-dropping duplicates:  0

Checking for missing values¶

In [23]:
# let's check for missing values in the data
print('Missing Values:-\n')
df_data.isnull().sum()
Missing Values:-

Out[23]:
gender                    0
SeniorCitizen             0
Partner                   0
Dependents                0
tenure                  728
InternetService           0
OnlineSecurity            0
OnlineBackup              0
DeviceProtection          0
TechSupport               0
StreamingTV               0
StreamingMovies           0
Contract                  0
PaperlessBilling          0
PaymentMethod             0
MonthlyCharges          301
TotalCharges           1351
Churn                     0
PhoneServiceStatus        0
Internet_AddOnCount       0
IsNewCustomer             0
dtype: int64
In [24]:
# Impute Missing Values using KNN Imputer (please note we will have to use RobustScaler first before using KNNImputer)
#Define columns
num_cols_knn = ["tenure", "MonthlyCharges", "TotalCharges"]  # columns to impute
cat_cols_knn = ["Contract", "InternetService", "PaymentMethod", "SeniorCitizen", "Dependents"]  # context columns

#One-hot encode categorical columns
df_knn_encoded = pd.get_dummies(df_data[cat_cols_knn], drop_first=False)

#Scale numeric columns using RobustScaler
scaler = RobustScaler()
df_knn_scaled = df_data.copy()
df_knn_scaled[num_cols_knn] = scaler.fit_transform(df_knn_scaled[num_cols_knn])

#Combine numeric + encoded categorical columns
X_knn = pd.concat([df_knn_scaled[num_cols_knn], df_knn_encoded], axis=1)

#Initialize and apply KNN imputer
imputer = KNNImputer(n_neighbors=5, weights='uniform')
X_imputed = imputer.fit_transform(X_knn)

#Convert imputed data back to DataFrame
X_imputed = pd.DataFrame(X_imputed, columns=X_knn.columns, index=df_data.index)

#Reverse scaling to restore original values
X_imputed[num_cols_knn] = scaler.inverse_transform(X_imputed[num_cols_knn])

#Replace missing values in original DataFrame with imputed ones (rounded)
for col in num_cols_knn:
    df_data[col] = np.round(X_imputed[col], 2)

#Verify
print(df_data.isna().sum())
gender                 0
SeniorCitizen          0
Partner                0
Dependents             0
tenure                 0
InternetService        0
OnlineSecurity         0
OnlineBackup           0
DeviceProtection       0
TechSupport            0
StreamingTV            0
StreamingMovies        0
Contract               0
PaperlessBilling       0
PaymentMethod          0
MonthlyCharges         0
TotalCharges           0
Churn                  0
PhoneServiceStatus     0
Internet_AddOnCount    0
IsNewCustomer          0
dtype: int64
In [25]:
# Monthly Charges & Total Charges, together would lead to multicollinearity.
# Create 3 additional features to derive info from Total charges & then drop it

df_data["AvgMonthlySpend"] = (df_data["TotalCharges"] / df_data["tenure"].replace(0, np.nan))
df_data["BillingRatio"] = (df_data["TotalCharges"] / (df_data["MonthlyCharges"] * df_data["tenure"].replace(0, np.nan)))
df_data["RelativeSpend"] = (df_data["MonthlyCharges"] / df_data["AvgMonthlySpend"].replace(0, np.nan))

# Drop TotalCharges
df_data.drop(columns=["TotalCharges"], inplace=True)

# These ratios will generate NAN since we have tenure values as zero in denomenator. Let check:-
print(df_data.isna().sum())
gender                   0
SeniorCitizen            0
Partner                  0
Dependents               0
tenure                   0
InternetService          0
OnlineSecurity           0
OnlineBackup             0
DeviceProtection         0
TechSupport              0
StreamingTV              0
StreamingMovies          0
Contract                 0
PaperlessBilling         0
PaymentMethod            0
MonthlyCharges           0
Churn                    0
PhoneServiceStatus       0
Internet_AddOnCount      0
IsNewCustomer            0
AvgMonthlySpend        300
BillingRatio           300
RelativeSpend          302
dtype: int64
In [26]:
# These are new customers with tenure = 0 (or Total Charges = 0). Lets replace them with 0. 
# Mathematically, a new customer has no billing or spend yet.
# For logistic regression, we need numeric values (no NaN).
# New customers will have values = 0 for these ratios while Older customers retain their real derived values.
# The IsNewCustomer flag allows the model to learn that “0” = new user.

cols_ratios = ["AvgMonthlySpend", "BillingRatio", "RelativeSpend"]
df_data[cols_ratios] = df_data[cols_ratios].fillna(0)

print(df_data.isna().sum())
gender                 0
SeniorCitizen          0
Partner                0
Dependents             0
tenure                 0
InternetService        0
OnlineSecurity         0
OnlineBackup           0
DeviceProtection       0
TechSupport            0
StreamingTV            0
StreamingMovies        0
Contract               0
PaperlessBilling       0
PaymentMethod          0
MonthlyCharges         0
Churn                  0
PhoneServiceStatus     0
Internet_AddOnCount    0
IsNewCustomer          0
AvgMonthlySpend        0
BillingRatio           0
RelativeSpend          0
dtype: int64
In [27]:
# tenure is one of the strongest churn predictors — but its effect is non-linear. Binning helps capture those non-linear lifecycle effects explicitly
df_data["TenureGroup"] = pd.cut(
    df_data["tenure"],
    bins=[-1, 6, 12, 24, 48, 80],
    labels=["0–6m", "7–12m", "13–24m", "25–48m", "49m+"]
)

# Drop tenure
df_data.drop(columns=["tenure"], inplace=True)
In [28]:
# ContractPaymentCombo captures interaction between billing preference and commitment length
df_data["ContractPaymentCombo"] = df_data["Contract"] + "_" + df_data["PaymentMethod"]
In [29]:
# Add another feature Cost Deviation - Price sensitivity by service type
# Customers paying more than average for their (contract + service type) combo may churn from perceived unfair pricing.

df_data["ServiceAvgCharge"] = df_data.groupby(["Contract", "InternetService"])["MonthlyCharges"].transform("mean")
df_data["CostDeviation"] = df_data["MonthlyCharges"] - df_data["ServiceAvgCharge"]
df_data.drop(columns=["ServiceAvgCharge"], inplace=True)
In [30]:
#Checking datatypes again
# Convert all object columns to category dtype
df_data = df_data.astype({col: "category" for col in df_data.select_dtypes(include="object").columns})
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12028 entries, 0 to 12027
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   gender                12028 non-null  category
 1   SeniorCitizen         12028 non-null  category
 2   Partner               12028 non-null  category
 3   Dependents            12028 non-null  category
 4   InternetService       12028 non-null  category
 5   OnlineSecurity        12028 non-null  category
 6   OnlineBackup          12028 non-null  category
 7   DeviceProtection      12028 non-null  category
 8   TechSupport           12028 non-null  category
 9   StreamingTV           12028 non-null  category
 10  StreamingMovies       12028 non-null  category
 11  Contract              12028 non-null  category
 12  PaperlessBilling      12028 non-null  category
 13  PaymentMethod         12028 non-null  category
 14  MonthlyCharges        12028 non-null  float64 
 15  Churn                 12028 non-null  category
 16  PhoneServiceStatus    12028 non-null  category
 17  Internet_AddOnCount   12028 non-null  int64   
 18  IsNewCustomer         12028 non-null  category
 19  AvgMonthlySpend       12028 non-null  float64 
 20  BillingRatio          12028 non-null  float64 
 21  RelativeSpend         12028 non-null  float64 
 22  TenureGroup           12028 non-null  category
 23  ContractPaymentCombo  12028 non-null  category
 24  CostDeviation         12028 non-null  float64 
dtypes: category(19), float64(5), int64(1)
memory usage: 789.9 KB

Statistical summary of the dataset¶

In [31]:
# let's view the statistical summary of the numerical columns in the data
df_data.describe(include='all').T
Out[31]:
count unique top freq mean std min 25% 50% 75% max
gender 12028 2 Male 6695 NaN NaN NaN NaN NaN NaN NaN
SeniorCitizen 12028 2 No 10608 NaN NaN NaN NaN NaN NaN NaN
Partner 12028 2 No 6970 NaN NaN NaN NaN NaN NaN NaN
Dependents 12028 2 No 8822 NaN NaN NaN NaN NaN NaN NaN
InternetService 12028 3 Fiber optic 4866 NaN NaN NaN NaN NaN NaN NaN
OnlineSecurity 12028 3 No 6309 NaN NaN NaN NaN NaN NaN NaN
OnlineBackup 12028 3 No 5948 NaN NaN NaN NaN NaN NaN NaN
DeviceProtection 12028 3 Yes 4609 NaN NaN NaN NaN NaN NaN NaN
TechSupport 12028 3 No 6221 NaN NaN NaN NaN NaN NaN NaN
StreamingTV 12028 3 No 4996 NaN NaN NaN NaN NaN NaN NaN
StreamingMovies 12028 3 No 5049 NaN NaN NaN NaN NaN NaN NaN
Contract 12028 3 Month-to-month 6533 NaN NaN NaN NaN NaN NaN NaN
PaperlessBilling 12028 2 Yes 6144 NaN NaN NaN NaN NaN NaN NaN
PaymentMethod 12028 4 Electronic check 4136 NaN NaN NaN NaN NaN NaN NaN
MonthlyCharges 12028.000 NaN NaN NaN 64.374 30.235 15.290 30.825 71.300 89.300 121.670
Churn 12028 2 No 8634 NaN NaN NaN NaN NaN NaN NaN
PhoneServiceStatus 12028 3 Multiple phone lines 5536 NaN NaN NaN NaN NaN NaN NaN
Internet_AddOnCount 12028.000 NaN NaN NaN 1.841 1.796 0.000 0.000 1.000 3.000 6.000
IsNewCustomer 12028 2 No 11728 NaN NaN NaN NaN NaN NaN NaN
AvgMonthlySpend 12028.000 NaN NaN NaN 109.382 249.554 0.000 29.039 71.222 99.441 6183.710
BillingRatio 12028.000 NaN NaN NaN 1.985 4.977 0.000 0.937 1.000 1.104 131.543
RelativeSpend 12028.000 NaN NaN NaN 1.654 17.737 0.000 0.844 0.997 1.050 1819.333
TenureGroup 12028 5 49m+ 3578 NaN NaN NaN NaN NaN NaN NaN
ContractPaymentCombo 12028 12 Month-to-month_Electronic check 3215 NaN NaN NaN NaN NaN NaN NaN
CostDeviation 12028.000 NaN NaN NaN 0.000 13.678 -79.520 -6.751 -0.901 7.391 85.294
In [32]:
# Exporting clean version 
df_data.to_csv("cleaned_customer_churn.csv", index=False,encoding="utf-8-sig")

Exploratory Data Analysis¶

The below functions need to be defined to carry out the Exploratory Data Analysis.¶

In [33]:
    # function to plot a boxplot and a histogram along the same scale.
    
    
    def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
        """
        Boxplot and histogram combined
    
        data: dataframe
        feature: dataframe column
        figsize: size of figure (default (12,7))
        kde: whether to the show density curve (default False)
        bins: number of bins for histogram (default None)
        """
        f2, (ax_box2, ax_hist2) = plt.subplots(
            nrows=2,  # Number of rows of the subplot grid= 2
            sharex=True,  # x-axis will be shared among all subplots
            gridspec_kw={"height_ratios": (0.25, 0.75)},
            figsize=figsize,
        )  # creating the 2 subplots
        sns.boxplot(
            data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
        )  # boxplot will be created and a triangle will indicate the mean value of the column
        sns.histplot(
            data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
        ) if bins else sns.histplot(
            data=data, x=feature, kde=kde, ax=ax_hist2
        )  # For histogram
        ax_hist2.axvline(
            data[feature].mean(), color="green", linestyle="--"
        )  # Add mean to the histogram
        ax_hist2.axvline(
            data[feature].median(), color="black", linestyle="-"
        )  # Add median to the histogram
        f2.suptitle(f"Histogram Boxplot: {feature}", fontsize=12, fontweight="bold")

    # Tight layout to adjust spacing
    plt.tight_layout()
    plt.show();
<Figure size 640x480 with 0 Axes>
In [34]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Spectral",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage
    
        plt.title(f"Labeled Barplot: {feature}", fontsize=12, fontweight="bold")
    
    plt.show()  # show the plot
In [35]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print category counts and plot a stacked bar chart with dynamic title and % y-axis.

    Parameters
    ----------
    data : DataFrame
        Input dataset
    predictor : str
        Independent variable (categorical)
    target : str
        Target variable (categorical)
    """

    # --- Crosstabs for counts and proportions ---
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]

    # Display counts table
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)

    # Proportions table for plotting
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )

    # --- Plot ---
    colors = ["#1f77b4", "#d62728"]   # blue, red (classic contrast)
    ax = tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5), color=colors)

    # Dynamic chart title
    plt.title(f"Stacked Barplot: {predictor} vs {target}", fontsize=12, fontweight="bold")

    # Format y-axis as percentage
    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: "{:.0f}%".format(y * 100)))

    plt.ylabel("Percentage", fontsize=12)
    plt.xlabel(predictor, fontsize=12)
    plt.xticks(rotation=45, ha='right')

    # Adjust legend position and layout
    plt.legend(
        title=target,
        loc="upper left",
        bbox_to_anchor=(1, 1),
        frameon=False
    )

    plt.tight_layout()
    plt.show()
In [36]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )
   
    plt.show()

Univariate analysis¶

In [37]:
# Separate categorical and numerical columns
cat_cols = df_data.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = df_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
print('Category Columns:',cat_cols)
print('Numerical Columns',num_cols)
Category Columns: ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn', 'PhoneServiceStatus', 'IsNewCustomer', 'TenureGroup', 'ContractPaymentCombo']
Numerical Columns ['MonthlyCharges', 'Internet_AddOnCount', 'AvgMonthlySpend', 'BillingRatio', 'RelativeSpend', 'CostDeviation']

Univariate Analysis - Categorical¶

In [38]:
for i in cat_cols:
 labeled_barplot(data=df_data, feature=i, perc=True, n=None)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Univariate Analysis - Numerical¶

In [39]:
for i in num_cols:
    histogram_boxplot(data=df_data, feature=i, figsize=(12, 7), kde=False, bins=None)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Bivariate Distributions¶

In [40]:
for i in cat_cols:
    stacked_barplot(df_data, i, "Churn")
    print('************************************************************************************************************************')
Churn     No   Yes    All
gender                   
All     8634  3394  12028
Male    4802  1893   6695
Female  3832  1501   5333
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn            No   Yes    All
SeniorCitizen                   
All            8634  3394  12028
No             7816  2792  10608
Yes             818   602   1420
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn      No   Yes    All
Partner                   
All      8634  3394  12028
No       4634  2336   6970
Yes      4000  1058   5058
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn         No   Yes    All
Dependents                   
All         8634  3394  12028
No          5944  2878   8822
Yes         2690   516   3206
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn              No   Yes    All
InternetService                   
All              8634  3394  12028
Fiber optic      2682  2184   4866
DSL              3400   945   4345
No               2552   265   2817
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                  No   Yes    All
OnlineSecurity                        
All                  8634  3394  12028
No                   3598  2711   6309
Yes                  2484   418   2902
No internet service  2552   265   2817
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                  No   Yes    All
OnlineBackup                          
All                  8634  3394  12028
No                   3544  2404   5948
Yes                  2538   725   3263
No internet service  2552   265   2817
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                  No   Yes    All
DeviceProtection                      
All                  8634  3394  12028
No                   2586  2016   4602
Yes                  3496  1113   4609
No internet service  2552   265   2817
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                  No   Yes    All
TechSupport                           
All                  8634  3394  12028
No                   3477  2744   6221
Yes                  2605   385   2990
No internet service  2552   265   2817
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                  No   Yes    All
StreamingTV                           
All                  8634  3394  12028
No                   3003  1993   4996
Yes                  3079  1136   4215
No internet service  2552   265   2817
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                  No   Yes    All
StreamingMovies                       
All                  8634  3394  12028
No                   3001  2048   5049
Yes                  3081  1081   4162
No internet service  2552   265   2817
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn             No   Yes    All
Contract                         
All             8634  3394  12028
Month-to-month  3527  3006   6533
One year        2265   290   2555
Two year        2842    98   2940
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn               No   Yes    All
PaperlessBilling                   
All               8634  3394  12028
Yes               3929  2215   6144
No                4705  1179   5884
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                        No   Yes    All
PaymentMethod                               
All                        8634  3394  12028
Electronic check           2106  2030   4136
Mailed check               2002   570   2572
Credit card (automatic)    2516   412   2928
Bank transfer (automatic)  2010   382   2392
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn    No   Yes    All
Churn                   
Yes       0  3394   3394
All    8634  3394  12028
No     8634     0   8634
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                   No   Yes    All
PhoneServiceStatus                     
All                   8634  3394  12028
Multiple phone lines  3852  1684   5536
Single phone line     3803  1383   5186
No phone service       979   327   1306
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn            No   Yes    All
IsNewCustomer                   
All            8634  3394  12028
No             8494  3234  11728
Yes             140   160    300
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn          No   Yes    All
TenureGroup                   
All          8634  3394  12028
0–6m         1401  1517   2918
25–48m       2237   614   2851
13–24m       1107   528   1635
7–12m         647   399   1046
49m+         3242   336   3578
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
Churn                                       No   Yes    All
ContractPaymentCombo                                       
All                                       8634  3394  12028
Month-to-month_Electronic check           1326  1889   3215
Month-to-month_Mailed check                977   509   1486
Month-to-month_Credit card (automatic)     654   313    967
Month-to-month_Bank transfer (automatic)   570   295    865
One year_Electronic check                  513   120    633
One year_Credit card (automatic)           747    71    818
One year_Bank transfer (automatic)         545    55    600
One year_Mailed check                      460    44    504
Two year_Bank transfer (automatic)         895    32    927
Two year_Credit card (automatic)          1115    28   1143
Two year_Electronic check                  267    21    288
Two year_Mailed check                      565    17    582
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
************************************************************************************************************************
In [41]:
distribution_plot_wrt_target(df_data, predictor='MonthlyCharges', target='Churn')
No description has been provided for this image
In [42]:
distribution_plot_wrt_target(df_data, predictor='Internet_AddOnCount', target='Churn')
No description has been provided for this image
In [43]:
distribution_plot_wrt_target(df_data, predictor='AvgMonthlySpend', target='Churn')
No description has been provided for this image
In [44]:
distribution_plot_wrt_target(df_data, predictor='BillingRatio', target='Churn')
No description has been provided for this image
In [45]:
distribution_plot_wrt_target(df_data, predictor='RelativeSpend', target='Churn')
No description has been provided for this image
In [46]:
distribution_plot_wrt_target(df_data, predictor='CostDeviation', target='Churn')
No description has been provided for this image

Multivariate Distributions¶

In [47]:
plt.figure(figsize=(12, 7))
sns.heatmap(df_data[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
No description has been provided for this image
In [48]:
# Plot the Pairplot
sns.pairplot(df_data[num_cols], diag_kind="kde")
plt.show()
No description has been provided for this image

Data Preprocessing¶

Outlier Detection¶

In [49]:
# outlier detection using boxplot
plt.figure(figsize=(15, 12))

for i, variable in enumerate(num_cols):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(df_data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

# MonthlyCharges, Internet_AddOnCount - no outliers & in limited range | no treatment required
# AvgMonthlySpend - Extremely high values may be the case of high churn outlier Treatment may lose important information. There is no data entry or noise | no treatment recommended
# BillingRatio, RelativeSpend, CostDeviation - Derived ratios | They represent real, valuable customer segments (like premium or high-risk users). Removing them would reduce your model’s ability to learn high-value churn behavior
No description has been provided for this image

Train-Test-Val Split¶

In [50]:
# creating the copy of the dataframe
df = df_data.copy()
print('Shape of Cleaned dataset:',df.shape)
Shape of Cleaned dataset: (12028, 25)
In [51]:
# reordering Columns - Churn made as last column & reset index
if 'Churn' in df.columns:
    churn = df.pop('Churn')
    df['Churn'] = churn
df.reset_index(drop=True, inplace=True)
df.head()
Out[51]:
gender SeniorCitizen Partner Dependents InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges PhoneServiceStatus Internet_AddOnCount IsNewCustomer AvgMonthlySpend BillingRatio RelativeSpend TenureGroup ContractPaymentCombo CostDeviation Churn
0 Female No Yes No DSL No Yes No No No No Month-to-month Yes Electronic check 29.850 No phone service 1 No 29.850 1.000 1.000 0–6m Month-to-month_Electronic check -26.548 No
1 Male No No No DSL Yes No Yes No No No One year No Mailed check 56.950 Single phone line 2 No 55.574 0.976 1.025 25–48m One year_Mailed check -10.352 No
2 Male No No No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.850 Single phone line 2 No 54.075 1.004 0.996 0–6m Month-to-month_Mailed check -2.548 Yes
3 Male No No No DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.300 No phone service 3 No 40.906 0.967 1.034 25–48m One year_Bank transfer (automatic) -25.002 No
4 Female No No No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.700 Single phone line 0 No 628.935 8.896 0.112 0–6m Month-to-month_Electronic check -15.651 Yes
In [52]:
# Dividing train data into X and y

X = df.drop(["Churn"], axis=1)
y = df["Churn"]
In [53]:
# Splitting data into training and validation set:

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print('Train dataset shape: ',X_train.shape,'\nValidation dataset shape: ', X_val.shape,'\nTest dataset shape: ', X_test.shape)
print("\nPercentage of classes in Train dataset: ",y_train.value_counts(normalize=True))
print("\nPercentage of classes in Validation dataset: ",y_val.value_counts(normalize=True))
print("\nPercentage of classes in Test dataset: ",y_test.value_counts(normalize=True))
Train dataset shape:  (7216, 24) 
Validation dataset shape:  (2406, 24) 
Test dataset shape:  (2406, 24)

Percentage of classes in Train dataset:  Churn
No    0.718
Yes   0.282
Name: proportion, dtype: float64

Percentage of classes in Validation dataset:  Churn
No    0.718
Yes   0.282
Name: proportion, dtype: float64

Percentage of classes in Test dataset:  Churn
No    0.718
Yes   0.282
Name: proportion, dtype: float64

Feature Encoding (categorical variables)¶

In [54]:
#Using OneHotEncoder
cat_cols = X_train.select_dtypes(include=['category']).columns.tolist()

# Initialize OneHotEncoder
# drop='first' avoids multicollinearity
# handle_unknown='ignore'safe if new category appears in val/test
ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

# Fit encoder only on training categorical data
ohe.fit(X_train[cat_cols])

# Transform all three datasets using the same encoder
encoded_train = ohe.transform(X_train[cat_cols])
encoded_val = ohe.transform(X_val[cat_cols])
encoded_test = ohe.transform(X_test[cat_cols])

# Convert encoded arrays to DataFrames with proper column names
encoded_train_df = pd.DataFrame(encoded_train, columns=ohe.get_feature_names_out(cat_cols), index=X_train.index)
encoded_val_df = pd.DataFrame(encoded_val, columns=ohe.get_feature_names_out(cat_cols), index=X_val.index)
encoded_test_df = pd.DataFrame(encoded_test, columns=ohe.get_feature_names_out(cat_cols), index=X_test.index)

# Drop original categorical columns and concatenate encoded ones
X_train_encoded = pd.concat([X_train.drop(columns=cat_cols), encoded_train_df], axis=1)
X_val_encoded = pd.concat([X_val.drop(columns=cat_cols), encoded_val_df], axis=1)
X_test_encoded = pd.concat([X_test.drop(columns=cat_cols), encoded_test_df], axis=1)

# Verify consistency
print('Encoding Complete!')
print(f"Train shape: {X_train_encoded.shape}")
print(f"Validation shape: {X_val_encoded.shape}")
print(f"Test shape: {X_test_encoded.shape}")

# Check all columns match
assert list(X_train_encoded.columns) == list(X_val_encoded.columns) == list(X_test_encoded.columns), "Encoded columns are misaligned!"
print("Success: Encoded columns are perfectly aligned across all datasets.")
Encoding Complete!
Train shape: (7216, 48)
Validation shape: (2406, 48)
Test shape: (2406, 48)
Success: Encoded columns are perfectly aligned across all datasets.

Feature Scaling (numerical variables)¶

In [55]:
# Identify numeric columns
num_cols = ['MonthlyCharges', 'Internet_AddOnCount', 'AvgMonthlySpend', 
            'BillingRatio', 'RelativeSpend', 'CostDeviation']

# Initialize scaler
scaler = RobustScaler()

# Create copies for scaled datasets
X_train_scaled = X_train_encoded.copy()
X_val_scaled   = X_val_encoded.copy()
X_test_scaled  = X_test_encoded.copy()

# Fit scaler only on training data (no leakage)
X_train_scaled[num_cols] = scaler.fit_transform(X_train_encoded[num_cols])

# Transform validation and test sets using the same scaler
X_val_scaled[num_cols] = scaler.transform(X_val_encoded[num_cols])
X_test_scaled[num_cols] = scaler.transform(X_test_encoded[num_cols])

# Verify scaling
print("Feature scaling (RobustScaler) complete.")
print(f"Scaled columns: {num_cols}")
print(f"Train shape: {X_train_scaled.shape}, Validation: {X_val_scaled.shape}, Test: {X_test_scaled.shape}")
Feature scaling (RobustScaler) complete.
Scaled columns: ['MonthlyCharges', 'Internet_AddOnCount', 'AvgMonthlySpend', 'BillingRatio', 'RelativeSpend', 'CostDeviation']
Train shape: (7216, 48), Validation: (2406, 48), Test: (2406, 48)
In [56]:
X_train_scaled.head().T
Out[56]:
10248 9664 11378 5957 6615
MonthlyCharges 0.288 0.616 0.347 -0.809 0.398
Internet_AddOnCount -0.333 -0.333 -0.333 -0.667 0.333
AvgMonthlySpend -0.913 1.025 0.367 -0.652 0.269
BillingRatio -5.177 1.975 0.391 0.095 -0.213
RelativeSpend 50.354 -1.158 -0.276 -0.057 0.197
CostDeviation 0.164 1.494 0.402 -2.152 -0.151
gender_Male 1.000 1.000 1.000 1.000 0.000
SeniorCitizen_Yes 0.000 1.000 0.000 1.000 0.000
Partner_Yes 1.000 0.000 0.000 0.000 0.000
Dependents_Yes 0.000 0.000 0.000 0.000 0.000
InternetService_Fiber optic 1.000 1.000 1.000 0.000 1.000
InternetService_No 0.000 0.000 0.000 0.000 0.000
OnlineSecurity_No internet service 0.000 0.000 0.000 0.000 0.000
OnlineSecurity_Yes 0.000 0.000 0.000 0.000 0.000
OnlineBackup_No internet service 0.000 0.000 0.000 0.000 0.000
OnlineBackup_Yes 0.000 0.000 0.000 0.000 1.000
DeviceProtection_No internet service 0.000 0.000 0.000 0.000 0.000
DeviceProtection_Yes 1.000 0.000 0.000 0.000 0.000
TechSupport_No internet service 0.000 0.000 0.000 0.000 0.000
TechSupport_Yes 0.000 0.000 0.000 0.000 0.000
StreamingTV_No internet service 0.000 0.000 0.000 0.000 0.000
StreamingTV_Yes 0.000 0.000 0.000 0.000 1.000
StreamingMovies_No internet service 0.000 0.000 0.000 0.000 0.000
StreamingMovies_Yes 0.000 1.000 1.000 0.000 1.000
Contract_One year 0.000 0.000 0.000 0.000 1.000
Contract_Two year 0.000 0.000 0.000 0.000 0.000
PaperlessBilling_Yes 1.000 0.000 1.000 1.000 1.000
PaymentMethod_Credit card (automatic) 0.000 0.000 0.000 0.000 0.000
PaymentMethod_Electronic check 1.000 1.000 1.000 0.000 0.000
PaymentMethod_Mailed check 0.000 0.000 0.000 0.000 1.000
PhoneServiceStatus_No phone service 0.000 0.000 0.000 1.000 0.000
PhoneServiceStatus_Single phone line 1.000 0.000 0.000 0.000 1.000
IsNewCustomer_Yes 0.000 0.000 0.000 0.000 0.000
TenureGroup_13–24m 0.000 0.000 0.000 0.000 0.000
TenureGroup_25–48m 0.000 1.000 1.000 0.000 0.000
TenureGroup_49m+ 0.000 0.000 0.000 0.000 0.000
TenureGroup_7–12m 1.000 0.000 0.000 0.000 1.000
ContractPaymentCombo_Month-to-month_Credit card (automatic) 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_Month-to-month_Electronic check 1.000 1.000 1.000 0.000 0.000
ContractPaymentCombo_Month-to-month_Mailed check 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_One year_Bank transfer (automatic) 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_One year_Credit card (automatic) 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_One year_Electronic check 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_One year_Mailed check 0.000 0.000 0.000 0.000 1.000
ContractPaymentCombo_Two year_Bank transfer (automatic) 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_Two year_Credit card (automatic) 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_Two year_Electronic check 0.000 0.000 0.000 0.000 0.000
ContractPaymentCombo_Two year_Mailed check 0.000 0.000 0.000 0.000 0.000

Class Imbalance Handling¶

In [57]:
# Check class balance in training labels
print("Class distribution in training data:")
print(y_train.value_counts())
print("\nPercentage distribution:")
print(y_train.value_counts(normalize=True) * 100)

# Optional: visualize
plt.figure(figsize=(5,4))
sns.countplot(x=y_train, palette='coolwarm')
plt.title("Class Distribution - Training Data", fontsize=12, fontweight='bold')
plt.xlabel("Churn")
plt.ylabel("Count")
plt.show()
Class distribution in training data:
Churn
No     5180
Yes    2036
Name: count, dtype: int64

Percentage distribution:
Churn
No    71.785
Yes   28.215
Name: proportion, dtype: float64
No description has been provided for this image
In [58]:
#Apply SMOTE
sm = SMOTE(random_state=42, sampling_strategy='auto', k_neighbors=5)
X_train_bal, y_train_bal = sm.fit_resample(X_train_scaled, y_train)

print("\nAfter SMOTE:")
print(y_train_bal.value_counts())
print("\nPercentage after balancing:")
print(y_train_bal.value_counts(normalize=True) * 100)

plt.figure(figsize=(5,4))
sns.countplot(x=y_train_bal, palette='crest')
plt.title("Post-SMOTE Class Distribution", fontsize=12, fontweight='bold')
plt.xlabel("Churn")
plt.ylabel("Count")
plt.show()
After SMOTE:
Churn
No     5180
Yes    5180
Name: count, dtype: int64

Percentage after balancing:
Churn
No    50.000
Yes   50.000
Name: proportion, dtype: float64
No description has been provided for this image
In [59]:
# Create final copies for modeling
X_train_final = X_train_bal.copy()
y_train_final = y_train_bal.copy()

X_val_final = X_val_scaled.copy()
y_val_final = y_val.copy()

X_test_final = X_test_scaled.copy()
y_test_final = y_test.copy()

print("Final datasets for modeling created successfully!")

# sanity checks
print(f"Train: {X_train_final.shape}, Validation: {X_val_final.shape}, Test: {X_test_final.shape}")
print(f"Train target distribution:\n{y_train_final.value_counts(normalize=True)*100}\n")
print(f"Validation target distribution:\n{y_val_final.value_counts(normalize=True)*100}\n")
print(f"Test target distribution:\n{y_test_final.value_counts(normalize=True)*100}\n")
Final datasets for modeling created successfully!
Train: (10360, 48), Validation: (2406, 48), Test: (2406, 48)
Train target distribution:
Churn
No    50.000
Yes   50.000
Name: proportion, dtype: float64

Validation target distribution:
Churn
No    71.779
Yes   28.221
Name: proportion, dtype: float64

Test target distribution:
Churn
No    71.779
Yes   28.221
Name: proportion, dtype: float64

In [60]:
X_train_final.to_csv("X_train_final.csv", index=False,encoding="utf-8-sig")
y_train_final.to_csv("y_train_final.csv", index=False,encoding="utf-8-sig")

Model Building – Baseline Model¶

Model evaluation criterion¶

Primary Metric: Recall

  • Business perspective: In churn prediction, missing a true churner (false negative) means losing a paying customer.
  • Operational impact: It’s better to flag a few loyal customers by mistake (false positives) than to miss an actual churner
  • Retention strategy: High recall ensures AlphaCom reaches most customers who are actually about to churn — maximizing the effectiveness of retention campaigns.
  • Goal: Maximize Recall (ideally >80%), even if precision slightly decreases.

Seondary Metric: F1-Score

  • Improving Precision indicates how many customers predicted as churners actually churned, which helps avoid wasted marketing resources (retention offers to loyal customers).
  • F1 Score balances Precision and Recall into one metric.
  • Useful when both missed churners (FN) and false alarms (FP) have business cost.

Functions to output different metrics (including recall) & to show confusion matrix¶

In [61]:
def model_performance_classification(model, predictors, target, threshold = 0.5):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    prob_pred = model.predict(predictors)
    class_pred = [1 if i >= threshold else 0 for i in prob_pred]

    acc = accuracy_score(target, class_pred)  # to compute Accuracy
    recall = recall_score(target, class_pred)  # to compute Recall
    precision = precision_score(target, class_pred)  # to compute Precision
    f1 = f1_score(target, class_pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [62]:
def plot_confusion_matrix(model, predictors, target, threshold = 0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    prob_pred = model.predict(predictors)
    class_pred = [1 if i >= threshold else 0 for i in prob_pred]
    cm = confusion_matrix(target, class_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [63]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [64]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building¶

Model 1 - Logistic Regression¶

In [65]:
# Define Function to calculate VIF
def calculate_vif(idf):
    """
    Calculate Variance Inflation Factor (VIF) for each variable in a DataFrame.

    Parameters:
    df (DataFrame): Input DataFrame containing numerical variables.

    Returns:
    vif_df (DataFrame): DataFrame containing variable names and their corresponding VIF values.
    """
    variables = idf.values
    vif_df = pd.DataFrame()
    vif_df["Variable"] = idf.columns
    vif_df["VIF"] = [variance_inflation_factor(variables, i) for i in range(idf.shape[1])]
    return vif_df
In [66]:
# Checking Multicollinearity
le = LabelEncoder()
y_train_m1 = le.fit_transform(y_train_final)
y_val_m1 = le.transform(y_val_final)
y_test_m1 = le.transform(y_test_final)

X_train_m1 = X_train_final.copy()
X_val_m1 = X_val_final.copy()
X_test_m1 = X_test_final.copy()

# Call the function to calculate VIF
vif_result = calculate_vif(X_train_m1)
print("Variance Inflation Factors:")
print(vif_result)
Variance Inflation Factors:
                                             Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 InternetService_No     inf
12                 OnlineSecurity_No internet service     inf
13                                 OnlineSecurity_Yes   4.794
14                   OnlineBackup_No internet service     inf
15                                   OnlineBackup_Yes   6.102
16               DeviceProtection_No internet service     inf
17                               DeviceProtection_Yes   8.563
18                    TechSupport_No internet service     inf
19                                    TechSupport_Yes   4.901
20                    StreamingTV_No internet service     inf
21                                    StreamingTV_Yes   8.857
22                StreamingMovies_No internet service     inf
23                                StreamingMovies_Yes   8.725
24                                  Contract_One year     inf
25                                  Contract_Two year     inf
26                               PaperlessBilling_Yes   1.243
27              PaymentMethod_Credit card (automatic)     inf
28                     PaymentMethod_Electronic check     inf
29                         PaymentMethod_Mailed check     inf
30                PhoneServiceStatus_No phone service   1.923
31               PhoneServiceStatus_Single phone line   1.470
32                                  IsNewCustomer_Yes   1.146
33                                 TenureGroup_13–24m   1.536
34                                 TenureGroup_25–48m   2.111
35                                   TenureGroup_49m+   3.237
36                                  TenureGroup_7–12m   1.300
37  ContractPaymentCombo_Month-to-month_Credit car...     inf
38  ContractPaymentCombo_Month-to-month_Electronic...     inf
39   ContractPaymentCombo_Month-to-month_Mailed check     inf
40  ContractPaymentCombo_One year_Bank transfer (a...     inf
41  ContractPaymentCombo_One year_Credit card (aut...     inf
42     ContractPaymentCombo_One year_Electronic check     inf
43         ContractPaymentCombo_One year_Mailed check     inf
44  ContractPaymentCombo_Two year_Bank transfer (a...     inf
45  ContractPaymentCombo_Two year_Credit card (aut...     inf
46     ContractPaymentCombo_Two year_Electronic check     inf
47         ContractPaymentCombo_Two year_Mailed check     inf
In [67]:
%%time
# Dropping columns with VIF > 5 iteratively
i_vif = 0
while vif_result['VIF'].max() > 5: 
     high_vif_column = vif_result.loc[vif_result['VIF'].idxmax(), 'Variable']
     print("\nDropping ",high_vif_column," due to high VIF")
     X_train_m1.drop(columns=high_vif_column, inplace=True)
     X_val_m1.drop(columns=high_vif_column, inplace=True)
     X_test_m1.drop(columns=high_vif_column, inplace=True)
     vif_result = calculate_vif(X_train_m1)
     i_vif = i_vif + 1
     print('\nIteration: ',i_vif,' | \n','VIF Result',vif_result)
print('\nVIF Check Complete')
Dropping  InternetService_No  due to high VIF

Iteration:  1  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                 OnlineSecurity_No internet service     inf
12                                 OnlineSecurity_Yes   4.794
13                   OnlineBackup_No internet service     inf
14                                   OnlineBackup_Yes   6.102
15               DeviceProtection_No internet service     inf
16                               DeviceProtection_Yes   8.563
17                    TechSupport_No internet service     inf
18                                    TechSupport_Yes   4.901
19                    StreamingTV_No internet service     inf
20                                    StreamingTV_Yes   8.857
21                StreamingMovies_No internet service     inf
22                                StreamingMovies_Yes   8.725
23                                  Contract_One year     inf
24                                  Contract_Two year     inf
25                               PaperlessBilling_Yes   1.243
26              PaymentMethod_Credit card (automatic)     inf
27                     PaymentMethod_Electronic check     inf
28                         PaymentMethod_Mailed check     inf
29                PhoneServiceStatus_No phone service   1.923
30               PhoneServiceStatus_Single phone line   1.470
31                                  IsNewCustomer_Yes   1.146
32                                 TenureGroup_13–24m   1.536
33                                 TenureGroup_25–48m   2.111
34                                   TenureGroup_49m+   3.237
35                                  TenureGroup_7–12m   1.300
36  ContractPaymentCombo_Month-to-month_Credit car...     inf
37  ContractPaymentCombo_Month-to-month_Electronic...     inf
38   ContractPaymentCombo_Month-to-month_Mailed check     inf
39  ContractPaymentCombo_One year_Bank transfer (a...     inf
40  ContractPaymentCombo_One year_Credit card (aut...     inf
41     ContractPaymentCombo_One year_Electronic check     inf
42         ContractPaymentCombo_One year_Mailed check     inf
43  ContractPaymentCombo_Two year_Bank transfer (a...     inf
44  ContractPaymentCombo_Two year_Credit card (aut...     inf
45     ContractPaymentCombo_Two year_Electronic check     inf
46         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  OnlineSecurity_No internet service  due to high VIF

Iteration:  2  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                   OnlineBackup_No internet service     inf
13                                   OnlineBackup_Yes   6.102
14               DeviceProtection_No internet service     inf
15                               DeviceProtection_Yes   8.563
16                    TechSupport_No internet service     inf
17                                    TechSupport_Yes   4.901
18                    StreamingTV_No internet service     inf
19                                    StreamingTV_Yes   8.857
20                StreamingMovies_No internet service     inf
21                                StreamingMovies_Yes   8.725
22                                  Contract_One year     inf
23                                  Contract_Two year     inf
24                               PaperlessBilling_Yes   1.243
25              PaymentMethod_Credit card (automatic)     inf
26                     PaymentMethod_Electronic check     inf
27                         PaymentMethod_Mailed check     inf
28                PhoneServiceStatus_No phone service   1.923
29               PhoneServiceStatus_Single phone line   1.470
30                                  IsNewCustomer_Yes   1.146
31                                 TenureGroup_13–24m   1.536
32                                 TenureGroup_25–48m   2.111
33                                   TenureGroup_49m+   3.237
34                                  TenureGroup_7–12m   1.300
35  ContractPaymentCombo_Month-to-month_Credit car...     inf
36  ContractPaymentCombo_Month-to-month_Electronic...     inf
37   ContractPaymentCombo_Month-to-month_Mailed check     inf
38  ContractPaymentCombo_One year_Bank transfer (a...     inf
39  ContractPaymentCombo_One year_Credit card (aut...     inf
40     ContractPaymentCombo_One year_Electronic check     inf
41         ContractPaymentCombo_One year_Mailed check     inf
42  ContractPaymentCombo_Two year_Bank transfer (a...     inf
43  ContractPaymentCombo_Two year_Credit card (aut...     inf
44     ContractPaymentCombo_Two year_Electronic check     inf
45         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  OnlineBackup_No internet service  due to high VIF

Iteration:  3  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13               DeviceProtection_No internet service     inf
14                               DeviceProtection_Yes   8.563
15                    TechSupport_No internet service     inf
16                                    TechSupport_Yes   4.901
17                    StreamingTV_No internet service     inf
18                                    StreamingTV_Yes   8.857
19                StreamingMovies_No internet service     inf
20                                StreamingMovies_Yes   8.725
21                                  Contract_One year     inf
22                                  Contract_Two year     inf
23                               PaperlessBilling_Yes   1.243
24              PaymentMethod_Credit card (automatic)     inf
25                     PaymentMethod_Electronic check     inf
26                         PaymentMethod_Mailed check     inf
27                PhoneServiceStatus_No phone service   1.923
28               PhoneServiceStatus_Single phone line   1.470
29                                  IsNewCustomer_Yes   1.146
30                                 TenureGroup_13–24m   1.536
31                                 TenureGroup_25–48m   2.111
32                                   TenureGroup_49m+   3.237
33                                  TenureGroup_7–12m   1.300
34  ContractPaymentCombo_Month-to-month_Credit car...     inf
35  ContractPaymentCombo_Month-to-month_Electronic...     inf
36   ContractPaymentCombo_Month-to-month_Mailed check     inf
37  ContractPaymentCombo_One year_Bank transfer (a...     inf
38  ContractPaymentCombo_One year_Credit card (aut...     inf
39     ContractPaymentCombo_One year_Electronic check     inf
40         ContractPaymentCombo_One year_Mailed check     inf
41  ContractPaymentCombo_Two year_Bank transfer (a...     inf
42  ContractPaymentCombo_Two year_Credit card (aut...     inf
43     ContractPaymentCombo_Two year_Electronic check     inf
44         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  DeviceProtection_No internet service  due to high VIF

Iteration:  4  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                    TechSupport_No internet service     inf
15                                    TechSupport_Yes   4.901
16                    StreamingTV_No internet service     inf
17                                    StreamingTV_Yes   8.857
18                StreamingMovies_No internet service     inf
19                                StreamingMovies_Yes   8.725
20                                  Contract_One year     inf
21                                  Contract_Two year     inf
22                               PaperlessBilling_Yes   1.243
23              PaymentMethod_Credit card (automatic)     inf
24                     PaymentMethod_Electronic check     inf
25                         PaymentMethod_Mailed check     inf
26                PhoneServiceStatus_No phone service   1.923
27               PhoneServiceStatus_Single phone line   1.470
28                                  IsNewCustomer_Yes   1.146
29                                 TenureGroup_13–24m   1.536
30                                 TenureGroup_25–48m   2.111
31                                   TenureGroup_49m+   3.237
32                                  TenureGroup_7–12m   1.300
33  ContractPaymentCombo_Month-to-month_Credit car...     inf
34  ContractPaymentCombo_Month-to-month_Electronic...     inf
35   ContractPaymentCombo_Month-to-month_Mailed check     inf
36  ContractPaymentCombo_One year_Bank transfer (a...     inf
37  ContractPaymentCombo_One year_Credit card (aut...     inf
38     ContractPaymentCombo_One year_Electronic check     inf
39         ContractPaymentCombo_One year_Mailed check     inf
40  ContractPaymentCombo_Two year_Bank transfer (a...     inf
41  ContractPaymentCombo_Two year_Credit card (aut...     inf
42     ContractPaymentCombo_Two year_Electronic check     inf
43         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  TechSupport_No internet service  due to high VIF

Iteration:  5  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                                    TechSupport_Yes   4.901
15                    StreamingTV_No internet service     inf
16                                    StreamingTV_Yes   8.857
17                StreamingMovies_No internet service     inf
18                                StreamingMovies_Yes   8.725
19                                  Contract_One year     inf
20                                  Contract_Two year     inf
21                               PaperlessBilling_Yes   1.243
22              PaymentMethod_Credit card (automatic)     inf
23                     PaymentMethod_Electronic check     inf
24                         PaymentMethod_Mailed check     inf
25                PhoneServiceStatus_No phone service   1.923
26               PhoneServiceStatus_Single phone line   1.470
27                                  IsNewCustomer_Yes   1.146
28                                 TenureGroup_13–24m   1.536
29                                 TenureGroup_25–48m   2.111
30                                   TenureGroup_49m+   3.237
31                                  TenureGroup_7–12m   1.300
32  ContractPaymentCombo_Month-to-month_Credit car...     inf
33  ContractPaymentCombo_Month-to-month_Electronic...     inf
34   ContractPaymentCombo_Month-to-month_Mailed check     inf
35  ContractPaymentCombo_One year_Bank transfer (a...     inf
36  ContractPaymentCombo_One year_Credit card (aut...     inf
37     ContractPaymentCombo_One year_Electronic check     inf
38         ContractPaymentCombo_One year_Mailed check     inf
39  ContractPaymentCombo_Two year_Bank transfer (a...     inf
40  ContractPaymentCombo_Two year_Credit card (aut...     inf
41     ContractPaymentCombo_Two year_Electronic check     inf
42         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  StreamingTV_No internet service  due to high VIF

Iteration:  6  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                                    TechSupport_Yes   4.901
15                                    StreamingTV_Yes   8.857
16                StreamingMovies_No internet service  58.263
17                                StreamingMovies_Yes   8.725
18                                  Contract_One year     inf
19                                  Contract_Two year     inf
20                               PaperlessBilling_Yes   1.243
21              PaymentMethod_Credit card (automatic)     inf
22                     PaymentMethod_Electronic check     inf
23                         PaymentMethod_Mailed check     inf
24                PhoneServiceStatus_No phone service   1.923
25               PhoneServiceStatus_Single phone line   1.470
26                                  IsNewCustomer_Yes   1.146
27                                 TenureGroup_13–24m   1.536
28                                 TenureGroup_25–48m   2.111
29                                   TenureGroup_49m+   3.237
30                                  TenureGroup_7–12m   1.300
31  ContractPaymentCombo_Month-to-month_Credit car...     inf
32  ContractPaymentCombo_Month-to-month_Electronic...     inf
33   ContractPaymentCombo_Month-to-month_Mailed check     inf
34  ContractPaymentCombo_One year_Bank transfer (a...     inf
35  ContractPaymentCombo_One year_Credit card (aut...     inf
36     ContractPaymentCombo_One year_Electronic check     inf
37         ContractPaymentCombo_One year_Mailed check     inf
38  ContractPaymentCombo_Two year_Bank transfer (a...     inf
39  ContractPaymentCombo_Two year_Credit card (aut...     inf
40     ContractPaymentCombo_Two year_Electronic check     inf
41         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  Contract_One year  due to high VIF

Iteration:  7  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                                    TechSupport_Yes   4.901
15                                    StreamingTV_Yes   8.857
16                StreamingMovies_No internet service  58.263
17                                StreamingMovies_Yes   8.725
18                                  Contract_Two year     inf
19                               PaperlessBilling_Yes   1.243
20              PaymentMethod_Credit card (automatic)     inf
21                     PaymentMethod_Electronic check     inf
22                         PaymentMethod_Mailed check     inf
23                PhoneServiceStatus_No phone service   1.923
24               PhoneServiceStatus_Single phone line   1.470
25                                  IsNewCustomer_Yes   1.146
26                                 TenureGroup_13–24m   1.536
27                                 TenureGroup_25–48m   2.111
28                                   TenureGroup_49m+   3.237
29                                  TenureGroup_7–12m   1.300
30  ContractPaymentCombo_Month-to-month_Credit car...     inf
31  ContractPaymentCombo_Month-to-month_Electronic...     inf
32   ContractPaymentCombo_Month-to-month_Mailed check     inf
33  ContractPaymentCombo_One year_Bank transfer (a...   2.148
34  ContractPaymentCombo_One year_Credit card (aut...     inf
35     ContractPaymentCombo_One year_Electronic check     inf
36         ContractPaymentCombo_One year_Mailed check     inf
37  ContractPaymentCombo_Two year_Bank transfer (a...     inf
38  ContractPaymentCombo_Two year_Credit card (aut...     inf
39     ContractPaymentCombo_Two year_Electronic check     inf
40         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  Contract_Two year  due to high VIF

Iteration:  8  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                                    TechSupport_Yes   4.901
15                                    StreamingTV_Yes   8.857
16                StreamingMovies_No internet service  58.263
17                                StreamingMovies_Yes   8.725
18                               PaperlessBilling_Yes   1.243
19              PaymentMethod_Credit card (automatic)     inf
20                     PaymentMethod_Electronic check     inf
21                         PaymentMethod_Mailed check     inf
22                PhoneServiceStatus_No phone service   1.923
23               PhoneServiceStatus_Single phone line   1.470
24                                  IsNewCustomer_Yes   1.146
25                                 TenureGroup_13–24m   1.536
26                                 TenureGroup_25–48m   2.111
27                                   TenureGroup_49m+   3.237
28                                  TenureGroup_7–12m   1.300
29  ContractPaymentCombo_Month-to-month_Credit car...     inf
30  ContractPaymentCombo_Month-to-month_Electronic...     inf
31   ContractPaymentCombo_Month-to-month_Mailed check     inf
32  ContractPaymentCombo_One year_Bank transfer (a...   2.148
33  ContractPaymentCombo_One year_Credit card (aut...     inf
34     ContractPaymentCombo_One year_Electronic check     inf
35         ContractPaymentCombo_One year_Mailed check     inf
36  ContractPaymentCombo_Two year_Bank transfer (a...   3.534
37  ContractPaymentCombo_Two year_Credit card (aut...     inf
38     ContractPaymentCombo_Two year_Electronic check     inf
39         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  PaymentMethod_Credit card (automatic)  due to high VIF

Iteration:  9  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                                    TechSupport_Yes   4.901
15                                    StreamingTV_Yes   8.857
16                StreamingMovies_No internet service  58.263
17                                StreamingMovies_Yes   8.725
18                               PaperlessBilling_Yes   1.243
19                     PaymentMethod_Electronic check     inf
20                         PaymentMethod_Mailed check     inf
21                PhoneServiceStatus_No phone service   1.923
22               PhoneServiceStatus_Single phone line   1.470
23                                  IsNewCustomer_Yes   1.146
24                                 TenureGroup_13–24m   1.536
25                                 TenureGroup_25–48m   2.111
26                                   TenureGroup_49m+   3.237
27                                  TenureGroup_7–12m   1.300
28  ContractPaymentCombo_Month-to-month_Credit car...   1.940
29  ContractPaymentCombo_Month-to-month_Electronic...     inf
30   ContractPaymentCombo_Month-to-month_Mailed check     inf
31  ContractPaymentCombo_One year_Bank transfer (a...   2.148
32  ContractPaymentCombo_One year_Credit card (aut...   2.555
33     ContractPaymentCombo_One year_Electronic check     inf
34         ContractPaymentCombo_One year_Mailed check     inf
35  ContractPaymentCombo_Two year_Bank transfer (a...   3.534
36  ContractPaymentCombo_Two year_Credit card (aut...   3.844
37     ContractPaymentCombo_Two year_Electronic check     inf
38         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  PaymentMethod_Electronic check  due to high VIF

Iteration:  10  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                                    TechSupport_Yes   4.901
15                                    StreamingTV_Yes   8.857
16                StreamingMovies_No internet service  58.263
17                                StreamingMovies_Yes   8.725
18                               PaperlessBilling_Yes   1.243
19                         PaymentMethod_Mailed check     inf
20                PhoneServiceStatus_No phone service   1.923
21               PhoneServiceStatus_Single phone line   1.470
22                                  IsNewCustomer_Yes   1.146
23                                 TenureGroup_13–24m   1.536
24                                 TenureGroup_25–48m   2.111
25                                   TenureGroup_49m+   3.237
26                                  TenureGroup_7–12m   1.300
27  ContractPaymentCombo_Month-to-month_Credit car...   1.940
28  ContractPaymentCombo_Month-to-month_Electronic...   4.093
29   ContractPaymentCombo_Month-to-month_Mailed check     inf
30  ContractPaymentCombo_One year_Bank transfer (a...   2.148
31  ContractPaymentCombo_One year_Credit card (aut...   2.555
32     ContractPaymentCombo_One year_Electronic check   2.535
33         ContractPaymentCombo_One year_Mailed check     inf
34  ContractPaymentCombo_Two year_Bank transfer (a...   3.534
35  ContractPaymentCombo_Two year_Credit card (aut...   3.844
36     ContractPaymentCombo_Two year_Electronic check   1.919
37         ContractPaymentCombo_Two year_Mailed check     inf

Dropping  PaymentMethod_Mailed check  due to high VIF

Iteration:  11  | 
 VIF Result                                              Variable     VIF
0                                      MonthlyCharges 186.402
1                                 Internet_AddOnCount  61.586
2                                     AvgMonthlySpend   4.037
3                                        BillingRatio   4.059
4                                       RelativeSpend   1.005
5                                       CostDeviation  39.326
6                                         gender_Male   1.015
7                                   SeniorCitizen_Yes   1.100
8                                         Partner_Yes   1.344
9                                      Dependents_Yes   1.265
10                        InternetService_Fiber optic  48.744
11                                 OnlineSecurity_Yes   4.794
12                                   OnlineBackup_Yes   6.102
13                               DeviceProtection_Yes   8.563
14                                    TechSupport_Yes   4.901
15                                    StreamingTV_Yes   8.857
16                StreamingMovies_No internet service  58.263
17                                StreamingMovies_Yes   8.725
18                               PaperlessBilling_Yes   1.243
19                PhoneServiceStatus_No phone service   1.923
20               PhoneServiceStatus_Single phone line   1.470
21                                  IsNewCustomer_Yes   1.146
22                                 TenureGroup_13–24m   1.536
23                                 TenureGroup_25–48m   2.111
24                                   TenureGroup_49m+   3.237
25                                  TenureGroup_7–12m   1.300
26  ContractPaymentCombo_Month-to-month_Credit car...   1.940
27  ContractPaymentCombo_Month-to-month_Electronic...   4.093
28   ContractPaymentCombo_Month-to-month_Mailed check   2.691
29  ContractPaymentCombo_One year_Bank transfer (a...   2.148
30  ContractPaymentCombo_One year_Credit card (aut...   2.555
31     ContractPaymentCombo_One year_Electronic check   2.535
32         ContractPaymentCombo_One year_Mailed check   1.924
33  ContractPaymentCombo_Two year_Bank transfer (a...   3.534
34  ContractPaymentCombo_Two year_Credit card (aut...   3.844
35     ContractPaymentCombo_Two year_Electronic check   1.919
36         ContractPaymentCombo_Two year_Mailed check   2.290

Dropping  MonthlyCharges  due to high VIF

Iteration:  12  | 
 VIF Result                                              Variable    VIF
0                                 Internet_AddOnCount 22.266
1                                     AvgMonthlySpend  4.037
2                                        BillingRatio  4.057
3                                       RelativeSpend  1.005
4                                       CostDeviation  1.740
5                                         gender_Male  1.015
6                                   SeniorCitizen_Yes  1.100
7                                         Partner_Yes  1.344
8                                      Dependents_Yes  1.265
9                         InternetService_Fiber optic  2.374
10                                 OnlineSecurity_Yes  2.827
11                                   OnlineBackup_Yes  3.349
12                               DeviceProtection_Yes  4.519
13                                    TechSupport_Yes  3.032
14                                    StreamingTV_Yes  5.226
15                StreamingMovies_No internet service  2.653
16                                StreamingMovies_Yes  5.118
17                               PaperlessBilling_Yes  1.243
18                PhoneServiceStatus_No phone service  1.906
19               PhoneServiceStatus_Single phone line  1.470
20                                  IsNewCustomer_Yes  1.145
21                                 TenureGroup_13–24m  1.531
22                                 TenureGroup_25–48m  2.095
23                                   TenureGroup_49m+  3.234
24                                  TenureGroup_7–12m  1.297
25  ContractPaymentCombo_Month-to-month_Credit car...  1.940
26  ContractPaymentCombo_Month-to-month_Electronic...  4.091
27   ContractPaymentCombo_Month-to-month_Mailed check  2.653
28  ContractPaymentCombo_One year_Bank transfer (a...  1.607
29  ContractPaymentCombo_One year_Credit card (aut...  1.827
30     ContractPaymentCombo_One year_Electronic check  1.723
31         ContractPaymentCombo_One year_Mailed check  1.520
32  ContractPaymentCombo_Two year_Bank transfer (a...  2.117
33  ContractPaymentCombo_Two year_Credit card (aut...  2.316
34     ContractPaymentCombo_Two year_Electronic check  1.332
35         ContractPaymentCombo_Two year_Mailed check  1.699

Dropping  Internet_AddOnCount  due to high VIF

Iteration:  13  | 
 VIF Result                                              Variable   VIF
0                                     AvgMonthlySpend 4.173
1                                        BillingRatio 4.202
2                                       RelativeSpend 1.006
3                                       CostDeviation 1.741
4                                         gender_Male 2.247
5                                   SeniorCitizen_Yes 1.272
6                                         Partner_Yes 2.212
7                                      Dependents_Yes 1.659
8                         InternetService_Fiber optic 4.141
9                                  OnlineSecurity_Yes 1.692
10                                   OnlineBackup_Yes 1.689
11                               DeviceProtection_Yes 2.275
12                                    TechSupport_Yes 1.835
13                                    StreamingTV_Yes 2.676
14                StreamingMovies_No internet service 2.963
15                                StreamingMovies_Yes 2.643
16                               PaperlessBilling_Yes 2.835
17                PhoneServiceStatus_No phone service 1.890
18               PhoneServiceStatus_Single phone line 2.276
19                                  IsNewCustomer_Yes 1.160
20                                 TenureGroup_13–24m 1.666
21                                 TenureGroup_25–48m 2.465
22                                   TenureGroup_49m+ 3.991
23                                  TenureGroup_7–12m 1.368
24  ContractPaymentCombo_Month-to-month_Credit car... 1.595
25  ContractPaymentCombo_Month-to-month_Electronic... 3.793
26   ContractPaymentCombo_Month-to-month_Mailed check 1.936
27  ContractPaymentCombo_One year_Bank transfer (a... 1.472
28  ContractPaymentCombo_One year_Credit card (aut... 1.651
29     ContractPaymentCombo_One year_Electronic check 1.570
30         ContractPaymentCombo_One year_Mailed check 1.382
31  ContractPaymentCombo_Two year_Bank transfer (a... 1.950
32  ContractPaymentCombo_Two year_Credit card (aut... 2.117
33     ContractPaymentCombo_Two year_Electronic check 1.271
34         ContractPaymentCombo_Two year_Mailed check 1.576

VIF Check Complete
CPU times: total: 1min 19s
Wall time: 1min 23s
In [68]:
#adding constants before building Logistic Model

X_train_m1 = statmod.add_constant(X_train_m1, has_constant='add')
X_val_m1 = statmod.add_constant(X_val_m1, has_constant='add')
X_test_m1 = statmod.add_constant(X_test_m1, has_constant='add')

LogisticReg = statmod.Logit(y_train_m1, X_train_m1).fit()
print(LogisticReg.summary())
Optimization terminated successfully.
         Current function value: 0.449573
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                10360
Model:                          Logit   Df Residuals:                    10324
Method:                           MLE   Df Model:                           35
Date:                Sat, 25 Oct 2025   Pseudo R-squ.:                  0.3514
Time:                        16:34:02   Log-Likelihood:                -4657.6
converged:                       True   LL-Null:                       -7181.0
Covariance Type:            nonrobust   LLR p-value:                     0.000
===============================================================================================================================
                                                                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------------------
const                                                           1.4244      0.133     10.683      0.000       1.163       1.686
AvgMonthlySpend                                                -0.0047      0.014     -0.335      0.738      -0.032       0.023
BillingRatio                                                   -0.0057      0.002     -2.585      0.010      -0.010      -0.001
RelativeSpend                                                  -0.0003      0.000     -1.305      0.192      -0.001       0.000
CostDeviation                                                   0.1506      0.035      4.290      0.000       0.082       0.219
gender_Male                                                     0.0148      0.054      0.275      0.783      -0.091       0.120
SeniorCitizen_Yes                                               0.2261      0.084      2.702      0.007       0.062       0.390
Partner_Yes                                                    -0.0635      0.063     -1.011      0.312      -0.186       0.060
Dependents_Yes                                                 -0.1865      0.071     -2.628      0.009      -0.326      -0.047
InternetService_Fiber optic                                     0.9647      0.079     12.176      0.000       0.809       1.120
OnlineSecurity_Yes                                             -0.5555      0.073     -7.595      0.000      -0.699      -0.412
OnlineBackup_Yes                                               -0.3584      0.066     -5.424      0.000      -0.488      -0.229
DeviceProtection_Yes                                           -0.1754      0.062     -2.821      0.005      -0.297      -0.054
TechSupport_Yes                                                -0.6163      0.074     -8.276      0.000      -0.762      -0.470
StreamingTV_Yes                                                 0.0676      0.072      0.935      0.350      -0.074       0.209
StreamingMovies_No internet service                            -1.2796      0.103    -12.371      0.000      -1.482      -1.077
StreamingMovies_Yes                                            -0.0746      0.071     -1.050      0.293      -0.214       0.065
PaperlessBilling_Yes                                            0.3207      0.059      5.460      0.000       0.206       0.436
PhoneServiceStatus_No phone service                            -0.0682      0.116     -0.588      0.556      -0.296       0.159
PhoneServiceStatus_Single phone line                           -0.3288      0.067     -4.900      0.000      -0.460      -0.197
IsNewCustomer_Yes                                              -0.0260      0.156     -0.167      0.867      -0.331       0.279
TenureGroup_13–24m                                             -1.3931      0.092    -15.112      0.000      -1.574      -1.212
TenureGroup_25–48m                                             -1.6445      0.094    -17.562      0.000      -1.828      -1.461
TenureGroup_49m+                                               -2.0698      0.114    -18.091      0.000      -2.294      -1.846
TenureGroup_7–12m                                              -1.0444      0.102    -10.239      0.000      -1.244      -0.844
ContractPaymentCombo_Month-to-month_Credit card (automatic)    -0.0816      0.114     -0.714      0.476      -0.306       0.143
ContractPaymentCombo_Month-to-month_Electronic check            0.4516      0.095      4.746      0.000       0.265       0.638
ContractPaymentCombo_Month-to-month_Mailed check               -0.0669      0.110     -0.609      0.543      -0.282       0.148
ContractPaymentCombo_One year_Bank transfer (automatic)        -0.6627      0.163     -4.059      0.000      -0.983      -0.343
ContractPaymentCombo_One year_Credit card (automatic)          -0.8666      0.154     -5.632      0.000      -1.168      -0.565
ContractPaymentCombo_One year_Electronic check                 -0.3807      0.138     -2.760      0.006      -0.651      -0.110
ContractPaymentCombo_One year_Mailed check                     -0.6182      0.175     -3.542      0.000      -0.960      -0.276
ContractPaymentCombo_Two year_Bank transfer (automatic)        -1.4325      0.201     -7.121      0.000      -1.827      -1.038
ContractPaymentCombo_Two year_Credit card (automatic)          -1.8014      0.209     -8.614      0.000      -2.211      -1.392
ContractPaymentCombo_Two year_Electronic check                 -1.0972      0.248     -4.430      0.000      -1.583      -0.612
ContractPaymentCombo_Two year_Mailed check                     -1.4029      0.255     -5.499      0.000      -1.903      -0.903
===============================================================================================================================
In [69]:
%%time
#Dealing with high p-value variables
m1_predictors = X_train_m1.copy()
cols = m1_predictors.columns.tolist()

# setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    # defining the train set
    X_train_m1_aux = m1_predictors[cols]

    # fitting the model
    m1_Pval = statmod.Logit(y_train_m1, X_train_m1_aux).fit()

    # getting the p-values and the maximum p-value
    p_values = m1_Pval.pvalues
    max_p_value = max(p_values)

    # name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
        print("Dropping column ",feature_with_p_max," with p-value: ",max_p_value)
    else:
        break

selected_features = cols
print(selected_features)
Optimization terminated successfully.
         Current function value: 0.449573
         Iterations 7
Dropping column  IsNewCustomer_Yes  with p-value:  0.867495548164718
Optimization terminated successfully.
         Current function value: 0.449574
         Iterations 7
Dropping column  gender_Male  with p-value:  0.7833651267845947
Optimization terminated successfully.
         Current function value: 0.449578
         Iterations 7
Dropping column  AvgMonthlySpend  with p-value:  0.7326706402054493
Optimization terminated successfully.
         Current function value: 0.449583
         Iterations 7
Dropping column  PhoneServiceStatus_No phone service  with p-value:  0.5498272616638058
Optimization terminated successfully.
         Current function value: 0.449601
         Iterations 7
Dropping column  ContractPaymentCombo_Month-to-month_Mailed check  with p-value:  0.5388366716277079
Optimization terminated successfully.
         Current function value: 0.449619
         Iterations 7
Dropping column  ContractPaymentCombo_Month-to-month_Credit card (automatic)  with p-value:  0.6579722868342717
Optimization terminated successfully.
         Current function value: 0.449628
         Iterations 7
Dropping column  StreamingTV_Yes  with p-value:  0.37732290083021003
Optimization terminated successfully.
         Current function value: 0.449666
         Iterations 7
Dropping column  Partner_Yes  with p-value:  0.36619263294120474
Optimization terminated successfully.
         Current function value: 0.449705
         Iterations 7
Dropping column  StreamingMovies_Yes  with p-value:  0.35726002688727365
Optimization terminated successfully.
         Current function value: 0.449746
         Iterations 7
Dropping column  RelativeSpend  with p-value:  0.1956206042692712
Optimization terminated successfully.
         Current function value: 0.449834
         Iterations 7
['const', 'BillingRatio', 'CostDeviation', 'SeniorCitizen_Yes', 'Dependents_Yes', 'InternetService_Fiber optic', 'OnlineSecurity_Yes', 'OnlineBackup_Yes', 'DeviceProtection_Yes', 'TechSupport_Yes', 'StreamingMovies_No internet service', 'PaperlessBilling_Yes', 'PhoneServiceStatus_Single phone line', 'TenureGroup_13–24m', 'TenureGroup_25–48m', 'TenureGroup_49m+', 'TenureGroup_7–12m', 'ContractPaymentCombo_Month-to-month_Electronic check', 'ContractPaymentCombo_One year_Bank transfer (automatic)', 'ContractPaymentCombo_One year_Credit card (automatic)', 'ContractPaymentCombo_One year_Electronic check', 'ContractPaymentCombo_One year_Mailed check', 'ContractPaymentCombo_Two year_Bank transfer (automatic)', 'ContractPaymentCombo_Two year_Credit card (automatic)', 'ContractPaymentCombo_Two year_Electronic check', 'ContractPaymentCombo_Two year_Mailed check']
CPU times: total: 1.72 s
Wall time: 1.84 s
In [70]:
X_train_m1 = X_train_m1[selected_features]
X_val_m1 = X_val_m1[selected_features]
X_test_m1 = X_train_m1[selected_features]

m1_logit_tuned = statmod.Logit(y_train_m1,X_train_m1).fit()
print(m1_logit_tuned.summary())
Optimization terminated successfully.
         Current function value: 0.449834
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                10360
Model:                          Logit   Df Residuals:                    10334
Method:                           MLE   Df Model:                           25
Date:                Sat, 25 Oct 2025   Pseudo R-squ.:                  0.3510
Time:                        16:34:04   Log-Likelihood:                -4660.3
converged:                       True   LL-Null:                       -7181.0
Covariance Type:            nonrobust   LLR p-value:                     0.000
===========================================================================================================================
                                                              coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------------------------------
const                                                       1.3336      0.085     15.627      0.000       1.166       1.501
BillingRatio                                               -0.0062      0.001     -5.865      0.000      -0.008      -0.004
CostDeviation                                               0.1563      0.029      5.447      0.000       0.100       0.213
SeniorCitizen_Yes                                           0.2186      0.083      2.633      0.008       0.056       0.381
Dependents_Yes                                             -0.2101      0.066     -3.162      0.002      -0.340      -0.080
InternetService_Fiber optic                                 0.9831      0.064     15.359      0.000       0.858       1.109
OnlineSecurity_Yes                                         -0.5547      0.073     -7.595      0.000      -0.698      -0.412
OnlineBackup_Yes                                           -0.3604      0.066     -5.460      0.000      -0.490      -0.231
DeviceProtection_Yes                                       -0.1808      0.061     -2.959      0.003      -0.301      -0.061
TechSupport_Yes                                            -0.6175      0.074     -8.339      0.000      -0.763      -0.472
StreamingMovies_No internet service                        -1.2565      0.096    -13.045      0.000      -1.445      -1.068
PaperlessBilling_Yes                                        0.3207      0.058      5.491      0.000       0.206       0.435
PhoneServiceStatus_Single phone line                       -0.3100      0.059     -5.217      0.000      -0.426      -0.194
TenureGroup_13–24m                                         -1.3917      0.087    -15.911      0.000      -1.563      -1.220
TenureGroup_25–48m                                         -1.6459      0.088    -18.770      0.000      -1.818      -1.474
TenureGroup_49m+                                           -2.0756      0.108    -19.254      0.000      -2.287      -1.864
TenureGroup_7–12m                                          -1.0417      0.099    -10.537      0.000      -1.235      -0.848
ContractPaymentCombo_Month-to-month_Electronic check        0.5016      0.065      7.745      0.000       0.375       0.628
ContractPaymentCombo_One year_Bank transfer (automatic)    -0.6144      0.146     -4.197      0.000      -0.901      -0.327
ContractPaymentCombo_One year_Credit card (automatic)      -0.8159      0.135     -6.030      0.000      -1.081      -0.551
ContractPaymentCombo_One year_Electronic check             -0.3349      0.118     -2.845      0.004      -0.566      -0.104
ContractPaymentCombo_One year_Mailed check                 -0.5721      0.157     -3.646      0.000      -0.880      -0.265
ContractPaymentCombo_Two year_Bank transfer (automatic)    -1.3811      0.187     -7.386      0.000      -1.748      -1.015
ContractPaymentCombo_Two year_Credit card (automatic)      -1.7470      0.195     -8.953      0.000      -2.129      -1.365
ContractPaymentCombo_Two year_Electronic check             -1.0461      0.236     -4.427      0.000      -1.509      -0.583
ContractPaymentCombo_Two year_Mailed check                 -1.3455      0.243     -5.532      0.000      -1.822      -0.869
===========================================================================================================================
In [71]:
# Calculate odds ratios
odds_ratios = np.exp(m1_logit_tuned.params)
odds_table = pd.DataFrame({
    'Coefficient': m1_logit_tuned.params,
    'Odds Ratio': odds_ratios
})
print('Odds-ratio:- \n',odds_table)
Odds-ratio:- 
                                                     Coefficient  Odds Ratio
const                                                     1.334       3.795
BillingRatio                                             -0.006       0.994
CostDeviation                                             0.156       1.169
SeniorCitizen_Yes                                         0.219       1.244
Dependents_Yes                                           -0.210       0.811
InternetService_Fiber optic                               0.983       2.673
OnlineSecurity_Yes                                       -0.555       0.574
OnlineBackup_Yes                                         -0.360       0.697
DeviceProtection_Yes                                     -0.181       0.835
TechSupport_Yes                                          -0.618       0.539
StreamingMovies_No internet service                      -1.256       0.285
PaperlessBilling_Yes                                      0.321       1.378
PhoneServiceStatus_Single phone line                     -0.310       0.733
TenureGroup_13–24m                                       -1.392       0.249
TenureGroup_25–48m                                       -1.646       0.193
TenureGroup_49m+                                         -2.076       0.125
TenureGroup_7–12m                                        -1.042       0.353
ContractPaymentCombo_Month-to-month_Electronic ...        0.502       1.651
ContractPaymentCombo_One year_Bank transfer (au...       -0.614       0.541
ContractPaymentCombo_One year_Credit card (auto...       -0.816       0.442
ContractPaymentCombo_One year_Electronic check           -0.335       0.715
ContractPaymentCombo_One year_Mailed check               -0.572       0.564
ContractPaymentCombo_Two year_Bank transfer (au...       -1.381       0.251
ContractPaymentCombo_Two year_Credit card (auto...       -1.747       0.174
ContractPaymentCombo_Two year_Electronic check           -1.046       0.351
ContractPaymentCombo_Two year_Mailed check               -1.346       0.260
In [72]:
y_pred_m1 = m1_logit_tuned.predict(X_train_m1)
fpr, tpr, thresholds = roc_curve(y_train_m1, y_pred_m1)

# Plot ROC curve
roc_auc = roc_auc_score(y_train_m1, y_pred_m1)  ## get the ROC-AUC score
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.grid()
plt.show()
No description has been provided for this image
In [73]:
# Find the optimal threshold
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_logit = round(thresholds[optimal_idx], 3)
print("\nOptimal Threshold: ", optimal_threshold_logit)
Optimal Threshold:  0.546
In [74]:
# Checking Model Performance on Training dataset
m1_logit_perf_train = model_performance_classification(
    m1_logit_tuned, X_train_m1, y_train_m1, optimal_threshold_logit
)
m1_logit_perf_train
Out[74]:
Accuracy Recall Precision F1
0 0.791 0.785 0.794 0.789
In [75]:
plot_confusion_matrix(m1_logit_tuned, X_train_m1, y_train_m1)
No description has been provided for this image
In [76]:
# Checking Model Performance on Validation dataset
m1_logit_perf_val = model_performance_classification(
    m1_logit_tuned, X_val_m1, y_val_m1, optimal_threshold_logit
)
m1_logit_perf_val
Out[76]:
Accuracy Recall Precision F1
0 0.779 0.757 0.583 0.659
In [77]:
plot_confusion_matrix(m1_logit_tuned, X_val_m1, y_val_m1)
No description has been provided for this image

Model 2 - Ridge Logistic Regression (L2) — Recall-optimized CV¶

In [78]:
%%time
le = LabelEncoder()
y_train_m2 = le.fit_transform(y_train_final)
y_val_m2 = le.transform(y_val_final)
y_test_m2 = le.transform(y_test_final)

X_train_m2 = X_train_final.copy()
X_val_m2 = X_val_final.copy()
X_test_m2 = X_test_final.copy()

m2_logit = LogisticRegressionCV(
    Cs=np.logspace(-3, 3, 10),   # search C from 0.001 to 1000
    cv=5,
    penalty='l2',
    solver='lbfgs',              # good for L2
    scoring='recall',            # optimize recall
    max_iter=2000,
    n_jobs=-1,
    random_state=42
)
m2_logit.fit(X_train_m2, y_train_m2)

print("Best C (Ridge):", m2_logit.C_[0])
Best C (Ridge): 2.154434690031882
CPU times: total: 125 ms
Wall time: 5.88 s
In [79]:
# Checking Model Performance on Training dataset
m2_logit_perf_train = model_performance_classification(
    m2_logit, X_train_m2, y_train_m2
)
m2_logit_perf_train
Out[79]:
Accuracy Recall Precision F1
0 0.788 0.812 0.774 0.793
In [80]:
plot_confusion_matrix(m2_logit, X_train_m2, y_train_m2)
No description has been provided for this image
In [81]:
# Checking Model Performance on Validation dataset
m2_logit_perf_val = model_performance_classification(
    m2_logit, X_val_m2, y_val_m2
)
m2_logit_perf_val
Out[81]:
Accuracy Recall Precision F1
0 0.766 0.789 0.561 0.656
In [82]:
plot_confusion_matrix(m2_logit, X_val_m2, y_val_m2)
No description has been provided for this image

Model 3 - Lasso Logistic Regression (L1) — Recall-optimized CV¶

In [83]:
%%time
le = LabelEncoder()
y_train_m3 = le.fit_transform(y_train_final)
y_val_m3 = le.transform(y_val_final)
y_test_m3 = le.transform(y_test_final)

X_train_m3 = X_train_final.copy()
X_val_m3 = X_val_final.copy()
X_test_m3 = X_test_final.copy()

m3_logit = LogisticRegressionCV(
    Cs=np.logspace(-3, 3, 10),
    cv=5,
    penalty='l1',
    solver='liblinear',          # supports L1
    scoring='recall',            # optimize recall
    max_iter=2000,
    n_jobs=-1,
    random_state=42
)
m3_logit.fit(X_train_m3, y_train_m3)

print("Best C (Lasso):", m3_logit.C_[0])
Best C (Lasso): 0.46415888336127775
CPU times: total: 875 ms
Wall time: 7.4 s
In [84]:
# Checking Model Performance on Training dataset
m3_logit_perf_train = model_performance_classification(
    m3_logit, X_train_m3, y_train_m3
)
m3_logit_perf_train
Out[84]:
Accuracy Recall Precision F1
0 0.788 0.813 0.774 0.793
In [85]:
plot_confusion_matrix(m3_logit, X_train_m3, y_train_m3)
No description has been provided for this image
In [86]:
# Checking Model Performance on Validation dataset
m3_logit_perf_val = model_performance_classification(
    m3_logit, X_val_m3, y_val_m3
)
m3_logit_perf_val
Out[86]:
Accuracy Recall Precision F1
0 0.766 0.791 0.561 0.656
In [87]:
plot_confusion_matrix(m3_logit, X_val_m3, y_val_m3)
No description has been provided for this image

Model 4 - Elastic Net Logistic Regression (L1 + L2) — Recall-optimized CV¶

In [88]:
%%time
le = LabelEncoder()
y_train_m4 = le.fit_transform(y_train_final)
y_val_m4 = le.transform(y_val_final)
y_test_m4 = le.transform(y_test_final)

X_train_m4 = X_train_final.copy()
X_val_m4 = X_val_final.copy()
X_test_m4 = X_test_final.copy()

m4_logit = LogisticRegressionCV(
    Cs=np.logspace(-3, 3, 10),
    cv=5,
    penalty='elasticnet',
    solver='saga',               # required for elastic net
    l1_ratios=[0.2, 0.5, 0.8],   # try a few mixes of L1/L2
    scoring='recall',            # optimize recall
    max_iter=5000,               # saga may need more iterations
    n_jobs=-1,
    random_state=42
)
m4_logit.fit(X_train_m4, y_train_m4)

print("Best C (Elastic):", m4_logit.C_[0])
print("Best l1_ratio (Elastic):", m4_logit.l1_ratio_[0])
Best C (Elastic): 0.001
Best l1_ratio (Elastic): 0.5
CPU times: total: 42min 24s
Wall time: 6min 39s
In [89]:
# Checking Model Performance on Training dataset
m4_logit_perf_train = model_performance_classification(
    m4_logit, X_train_m4, y_train_m4
)
m4_logit_perf_train
Out[89]:
Accuracy Recall Precision F1
0 0.723 0.796 0.694 0.742
In [90]:
plot_confusion_matrix(m4_logit, X_train_m4, y_train_m4)
No description has been provided for this image
In [91]:
# Checking Model Performance on Validation dataset
m4_logit_perf_val = model_performance_classification(
    m4_logit, X_val_m4, y_val_m4
)
m4_logit_perf_val
Out[91]:
Accuracy Recall Precision F1
0 0.668 0.764 0.449 0.565
In [92]:
plot_confusion_matrix(m4_logit, X_val_m4, y_val_m4)
No description has been provided for this image

Model Building – Advanced Models¶

Model 5 - Decision Tree¶

In [93]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m5 = le.fit_transform(y_train_final)
y_val_m5 = le.transform(y_val_final)
y_test_m5 = le.transform(y_test_final)

X_train_m5 = X_train_final.copy()
X_val_m5 = X_val_final.copy()
X_test_m5 = X_test_final.copy()
In [94]:
# Build Decision Tree Model

# Initialize Decision Tree Model
m5_DT = DecisionTreeClassifier(
    criterion='gini',        # Splitting criterion ('entropy' can also be tried)
    max_depth=None,          # Let tree expand fully first; can tune later
    min_samples_split=2,     # Minimum samples needed to split an internal node
    min_samples_leaf=1,      # Minimum samples required at a leaf node
    random_state=42          # For reproducibility
)

# Fit the model
m5_DT.fit(X_train_m5, y_train_m5)
Out[94]:
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
In [95]:
# Checking Model Performance on Training dataset
m5_DT_perf_train = model_performance_classification_sklearn(m5_DT, X_train_m5, y_train_m5)
m5_DT_perf_train
Out[95]:
Accuracy Recall Precision F1
0 1.000 0.999 1.000 1.000
In [96]:
plot_confusion_matrix(m5_DT, X_train_m5, y_train_m5)
No description has been provided for this image
In [97]:
# Checking Model Performance on Validation dataset
m5_DT_perf_val = model_performance_classification_sklearn(m5_DT, X_val_m5, y_val_m5)
m5_DT_perf_val
Out[97]:
Accuracy Recall Precision F1
0 0.719 0.539 0.503 0.520
In [98]:
plot_confusion_matrix(m5_DT, X_val_m5, y_val_m5)
No description has been provided for this image

Model 6 - Bagging¶

In [99]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m6 = le.fit_transform(y_train_final)
y_val_m6 = le.transform(y_val_final)
y_test_m6 = le.transform(y_test_final)

X_train_m6 = X_train_final.copy()
X_val_m6 = X_val_final.copy()
X_test_m6 = X_test_final.copy()
In [100]:
# Building Bagging Classifier Model

# Initialize base estimator (Decision Tree)
base_tree = DecisionTreeClassifier(
    criterion='gini',
    random_state=42
)

# Initialize Bagging model
m6_Bag = BaggingClassifier(
    estimator=base_tree,
    n_estimators=100,       # number of base learners (trees)
    max_samples=1.0,        # fraction of samples per tree
    max_features=1.0,       # fraction of features per tree
    bootstrap=True,         # with replacement sampling
    random_state=42,
    n_jobs=-1               # parallel processing
)

# Fit the Bagging model
m6_Bag.fit(X_train_m6, y_train_m6)
Out[100]:
BaggingClassifier(estimator=DecisionTreeClassifier(random_state=42),
                  n_estimators=100, n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(estimator=DecisionTreeClassifier(random_state=42),
                  n_estimators=100, n_jobs=-1, random_state=42)
DecisionTreeClassifier(random_state=42)
DecisionTreeClassifier(random_state=42)
In [101]:
# Checking Model Performance on Training dataset
m6_Bag_perf_train = model_performance_classification_sklearn(m6_Bag, X_train_m6, y_train_m6)
m6_Bag_perf_train
Out[101]:
Accuracy Recall Precision F1
0 0.999 1.000 0.999 0.999
In [102]:
plot_confusion_matrix(m6_Bag, X_train_m6, y_train_m6)
No description has been provided for this image
In [103]:
# Checking Model Performance on Validation dataset
m6_Bag_perf_val = model_performance_classification_sklearn(m6_Bag, X_val_m6, y_val_m6)
m6_Bag_perf_val
Out[103]:
Accuracy Recall Precision F1
0 0.765 0.582 0.584 0.583
In [104]:
plot_confusion_matrix(m6_Bag, X_val_m6, y_val_m6)
No description has been provided for this image

Model 7 - Random Forest¶

In [105]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m7 = le.fit_transform(y_train_final)
y_val_m7 = le.transform(y_val_final)
y_test_m7 = le.transform(y_test_final)

X_train_m7 = X_train_final.copy()
X_val_m7 = X_val_final.copy()
X_test_m7 = X_test_final.copy()
In [106]:
# Building Random Forest Classifier Model

# Initialize Random Forest model
m7_RF = RandomForestClassifier(
    n_estimators=200,        # number of trees in the forest
    criterion='gini',        # splitting criterion
    max_depth=None,          # trees grow fully (can tune later)
    min_samples_split=2,     # minimum samples to split an internal node
    min_samples_leaf=1,      # minimum samples at leaf nodes
    max_features='sqrt',     # best practice for classification (sqrt of total features)
    bootstrap=True,          # bootstrap sampling for bagging
    random_state=42,         # reproducibility
    n_jobs=-1                # use all processors
)

# Fit the model
m7_RF.fit(X_train_m7, y_train_m7)
Out[106]:
RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
In [107]:
# Checking Model Performance on Training dataset
m7_RF_perf_train = model_performance_classification_sklearn(m7_RF, X_train_m7, y_train_m7)
m7_RF_perf_train
Out[107]:
Accuracy Recall Precision F1
0 1.000 1.000 0.999 1.000
In [108]:
plot_confusion_matrix(m7_RF, X_train_m7, y_train_m7)
No description has been provided for this image
In [109]:
# Checking Model Performance on Validation dataset
m7_RF_perf_val = model_performance_classification_sklearn(m7_RF, X_val_m7, y_val_m7)
m7_RF_perf_val
Out[109]:
Accuracy Recall Precision F1
0 0.785 0.589 0.627 0.607
In [110]:
plot_confusion_matrix(m7_RF, X_val_m7, y_val_m7)
No description has been provided for this image

Model 8 - AdaBoost¶

In [111]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m8 = le.fit_transform(y_train_final)
y_val_m8 = le.transform(y_val_final)
y_test_m8 = le.transform(y_test_final)

X_train_m8 = X_train_final.copy()
X_val_m8 = X_val_final.copy()
X_test_m8 = X_test_final.copy()
In [112]:
# Building AdaBoost Classifier Model

# Initialize the base estimator (weak learner)
base_estimator = DecisionTreeClassifier(
    max_depth=1,             # simple stump; prevents overfitting
    criterion='gini',
    random_state=42
)

# Initialize the AdaBoost model
m8_Ada = AdaBoostClassifier(
    estimator=base_estimator, # weak learner
    n_estimators=200,         # number of weak learners
    learning_rate=0.1,        # step size for weight updates
    random_state=42
)

# Fit the model
m8_Ada.fit(X_train_m8, y_train_m8)
Out[112]:
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1,
                                                    random_state=42),
                   learning_rate=0.1, n_estimators=200, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1,
                                                    random_state=42),
                   learning_rate=0.1, n_estimators=200, random_state=42)
DecisionTreeClassifier(max_depth=1, random_state=42)
DecisionTreeClassifier(max_depth=1, random_state=42)
In [113]:
# Checking Model Performance on Training dataset
m8_Ada_perf_train = model_performance_classification_sklearn(m8_Ada, X_train_m8, y_train_m8)
m8_Ada_perf_train
Out[113]:
Accuracy Recall Precision F1
0 0.822 0.842 0.811 0.826
In [114]:
plot_confusion_matrix(m8_Ada, X_train_m8, y_train_m8)
No description has been provided for this image
In [115]:
# Checking Model Performance on Validation dataset
m8_Ada_perf_val = model_performance_classification_sklearn(m8_Ada, X_val_m8, y_val_m8)
m8_Ada_perf_val
Out[115]:
Accuracy Recall Precision F1
0 0.776 0.739 0.581 0.651
In [116]:
plot_confusion_matrix(m8_Ada, X_val_m8, y_val_m8)
No description has been provided for this image

Model 9 - Gradient Boosting¶

In [117]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m9 = le.fit_transform(y_train_final)
y_val_m9 = le.transform(y_val_final)
y_test_m9 = le.transform(y_test_final)

X_train_m9 = X_train_final.copy()
X_val_m9 = X_val_final.copy()
X_test_m9 = X_test_final.copy()
In [118]:
# Building Gradient Boosting Classifier Model

# Initialize Gradient Boosting model
m9_GB = GradientBoostingClassifier(
    loss='log_loss',        # logistic loss for classification
    n_estimators=200,       # number of boosting stages (trees)
    learning_rate=0.05,     # how fast the model learns; lower = slower but safer
    max_depth=3,            # depth of individual regression trees (via max_depth in base learners)
    subsample=1.0,          # 1.0 = no stochastic subsampling; <1.0 can help generalization
    random_state=42
)

# Fit the model
m9_GB.fit(X_train_m9, y_train_m9)
Out[118]:
GradientBoostingClassifier(learning_rate=0.05, n_estimators=200,
                           random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.05, n_estimators=200,
                           random_state=42)
In [119]:
# Checking Model Performance on Training dataset
m9_GB_perf_train = model_performance_classification_sklearn(m9_GB, X_train_m9, y_train_m9)
m9_GB_perf_train
Out[119]:
Accuracy Recall Precision F1
0 0.864 0.860 0.866 0.863
In [120]:
plot_confusion_matrix(m9_GB, X_train_m9, y_train_m9)
No description has been provided for this image
In [121]:
# Checking Model Performance on Validation dataset
m9_GB_perf_val = model_performance_classification_sklearn(m9_GB, X_val_m9, y_val_m9)
m9_GB_perf_val
Out[121]:
Accuracy Recall Precision F1
0 0.788 0.647 0.619 0.633
In [122]:
plot_confusion_matrix(m9_GB, X_val_m9, y_val_m9)
No description has been provided for this image

Model 10 - XGBoost¶

In [123]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m10 = le.fit_transform(y_train_final)
y_val_m10 = le.transform(y_val_final)
y_test_m10 = le.transform(y_test_final)

X_train_m10 = X_train_final.copy()
X_val_m10 = X_val_final.copy()
X_test_m10 = X_test_final.copy()
In [124]:
%%time
# Building XGBoost Classifier Model
m10_XGB = XGBClassifier(
    n_estimators=300,        # number of boosting trees
    learning_rate=0.05,      # shrinkage / step size
    max_depth=4,             # tree depth (controls complexity)
    subsample=0.8,           # sample % of rows per tree (reduces overfitting)
    colsample_bytree=0.8,    # sample % of features per tree (reduces overfitting)
    reg_lambda=1.0,          # L2 regularization
    reg_alpha=0.0,           # L1 regularization (can tune >0 to encourage sparsity)
    min_child_weight=1,      # min sum of instance weight (min data per leaf)
    objective='binary:logistic',  # binary classification with probability output
    eval_metric='logloss',   # internal eval metric
    random_state=42,
    n_jobs=-1
)

# Fit the model
m10_XGB.fit(X_train_m10, y_train_m10)
CPU times: total: 4.59 s
Wall time: 1.15 s
Out[124]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=4,
              max_leaves=None, min_child_weight=1, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=300,
              n_jobs=-1, num_parallel_tree=None, random_state=42, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=4,
              max_leaves=None, min_child_weight=1, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=300,
              n_jobs=-1, num_parallel_tree=None, random_state=42, ...)
In [125]:
# Checking Model Performance on Training dataset
m10_XGB_perf_train = model_performance_classification_sklearn(m10_XGB, X_train_m10, y_train_m10)
m10_XGB_perf_train
Out[125]:
Accuracy Recall Precision F1
0 0.883 0.872 0.891 0.882
In [126]:
plot_confusion_matrix(m10_XGB, X_train_m10, y_train_m10)
No description has been provided for this image
In [127]:
# Checking Model Performance on Validation dataset
m10_XGB_perf_val = model_performance_classification_sklearn(m10_XGB, X_val_m10, y_val_m10)
m10_XGB_perf_val
Out[127]:
Accuracy Recall Precision F1
0 0.789 0.610 0.631 0.620
In [224]:
plot_confusion_matrix(m10_XGB, X_val_m10, y_val_m10)
No description has been provided for this image

Model Performance Improvement using Hyperparameter Tuning¶

Compare Model Performance (Baseline + Advanced)¶

In [208]:
model_comparison_df = []
# Dictionary to map model index
model_names = {
    1: "Logistic Regression (Baseline)",
    2: "Ridge Logistic Regression (L2)",
    3: "Lasso Logistic Regression (L1)",
    4: "Elastic Net Logistic Regression (L1+L2)",
    5: "Decision Tree",
    6: "Bagging Classifier",
    7: "Random Forest",
    8: "AdaBoost",
    9: "Gradient Boosting",
    10: "XGBoost"
}

# Dictionary to map model index
model_perf_dfs = {
    1: (m1_logit_perf_train,  m1_logit_perf_val),
    2: (m2_logit_perf_train,  m2_logit_perf_val),
    3: (m3_logit_perf_train,  m3_logit_perf_val),
    4: (m4_logit_perf_train,  m4_logit_perf_val),
    5: (m5_DT_perf_train,  m5_DT_perf_val),
    6: (m6_Bag_perf_train,  m6_Bag_perf_val),
    7: (m7_RF_perf_train,  m7_RF_perf_val),
    8: (m8_Ada_perf_train,  m8_Ada_perf_val),
    9: (m9_GB_perf_train,  m9_GB_perf_val),
    10: (m10_XGB_perf_train, m10_XGB_perf_val)
}

# Build rows for the final summary table
rows = []

for model_id, (train_df, val_df) in model_perf_dfs.items():
    train_recall = train_df["Recall"].iloc[0]
    val_recall = val_df["Recall"].iloc[0]
    rows.append({
        "Model No.": model_id,
        "Model Name": model_names[model_id],
        "Training Recall":  train_df["Recall"].iloc[0],
        "Validation Recall": val_df["Recall"].iloc[0],
        "Recall Gap (Train-Val)": round(train_recall - val_recall, 4),
        "Training F1":      train_df["F1"].iloc[0],
        "Validation F1":    val_df["F1"].iloc[0]
    })

# Create final comparison table
model_comparison_df = pd.DataFrame(rows)

# Sort by Validation Recall (descending) to see best churn-catchers first
model_comparison_df = model_comparison_df.sort_values(
    by="Validation Recall", ascending=False
).reset_index(drop=True)

model_comparison_df
Out[208]:
Model No. Model Name Training Recall Validation Recall Recall Gap (Train-Val) Training F1 Validation F1
0 3 Lasso Logistic Regression (L1) 0.813 0.791 0.022 0.793 0.656
1 2 Ridge Logistic Regression (L2) 0.812 0.789 0.023 0.793 0.656
2 4 Elastic Net Logistic Regression (L1+L2) 0.796 0.764 0.031 0.742 0.565
3 1 Logistic Regression (Baseline) 0.785 0.757 0.028 0.789 0.659
4 8 AdaBoost 0.842 0.739 0.102 0.826 0.651
5 9 Gradient Boosting 0.860 0.647 0.214 0.863 0.633
6 10 XGBoost 0.872 0.610 0.263 0.882 0.620
7 7 Random Forest 1.000 0.589 0.411 1.000 0.607
8 6 Bagging Classifier 1.000 0.582 0.418 0.999 0.583
9 5 Decision Tree 0.999 0.539 0.460 1.000 0.520

Besides Logistic Regression that has outperformed in our metrics, we shortlist the following 3 for further tuning:-

  • Model 8 – AdaBoost: Achieves a strong training recall (0.842) and a fairly balanced validation recall (0.739) — indicating good bias-variance tradeoff and potential for improvement with tuning.
  • Model 9 – Gradient Boosting: Shows decent validation recall (0.647) and strong F1 (0.633) — a stable boosting-based learner that can be optimized via learning rate, tree depth, and number of estimators to further enhance recall.
  • Model 10 – XGBoost: Demonstrates the highest recall (0.872) on training but some drop in validation (0.610) — suggests mild overfitting that can be corrected through hyperparameter tuning (regularization, learning rate, and depth control).

These three ensemble models already outperform basic tree/bagging models in generalization (Decision Tree, Random Forest, Bagging show clear overfitting with 1.0 recall/F1 on training). All three are boosting-based, hence capable of improving minority-class (churn) recall through careful tuning.

Model 11 - Logistic Regression - Tuned¶

In [129]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m11 = le.fit_transform(y_train_final)
y_val_m11 = le.transform(y_val_final)
y_test_m11 = le.transform(y_test_final)

X_train_m11 = X_train_final.copy()
X_val_m11 = X_val_final.copy()
X_test_m11 = X_test_final.copy()
In [130]:
%%time

#Build Tuned Logistic Regression Model - GridSearchCV
# Define base model
base_log = LogisticRegression(solver='liblinear', random_state=42)

# Define parameter grid for tuning
param_grid = {
    'penalty': ['l1', 'l2'],                 # L1 = Lasso, L2 = Ridge
    'C': [0.001, 0.01, 0.1, 1, 10, 100],     # Regularization strength (inverse of λ)
    'class_weight': [None, 'balanced'],      # Handle class imbalance if any
    'max_iter': [100, 500, 1000, 2000, 5000]       # Number of iterations for convergence
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=base_log,
    param_grid=param_grid,
    scoring='recall',        # Optimizing Recall (business goal: catch churners)
    cv=5,                    # 5-fold cross-validation
    n_jobs=-1,               # Use all cores for speed
    verbose=2
)

# Fit on training data
grid_search.fit(X_train_m11, y_train_m11)

# Print best parameters and best recall score
print("Best Parameters:", grid_search.best_params_)
print("Best Recall (CV):", grid_search.best_score_)

# Retrieve best model
m11_Logit_tuned_g = grid_search.best_estimator_
Fitting 5 folds for each of 120 candidates, totalling 600 fits
Best Parameters: {'C': 1, 'class_weight': None, 'max_iter': 100, 'penalty': 'l1'}
Best Recall (CV): 0.8113899613899613
CPU times: total: 4.3 s
Wall time: 37.6 s
In [131]:
# Checking Model Performance on Training dataset
m11_Logit_tuned_g_perf_train = model_performance_classification_sklearn(m11_Logit_tuned_g, X_train_m11, y_train_m11)
m11_Logit_tuned_g_perf_train
Out[131]:
Accuracy Recall Precision F1
0 0.788 0.812 0.775 0.793
In [132]:
plot_confusion_matrix(m11_Logit_tuned_g, X_train_m11, y_train_m11)
No description has been provided for this image
In [133]:
# Checking Model Performance on Validation dataset
m11_Logit_tuned_g_perf_val = model_performance_classification_sklearn(m11_Logit_tuned_g, X_val_m11, y_val_m11)
m11_Logit_tuned_g_perf_val
Out[133]:
Accuracy Recall Precision F1
0 0.766 0.789 0.560 0.655
In [134]:
plot_confusion_matrix(m11_Logit_tuned_g, X_val_m11, y_val_m11)
No description has been provided for this image

Model 12 - AdaBoost - Tuned¶

In [140]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m12 = le.fit_transform(y_train_final)
y_val_m12 = le.transform(y_val_final)
y_test_m12 = le.transform(y_test_final)

X_train_m12 = X_train_final.copy()
X_val_m12 = X_val_final.copy()
X_test_m12 = X_test_final.copy()
In [196]:
%%time

#Build Tuned AdaBoost Model - RandomSearchCV

# Base weak learner
stump = DecisionTreeClassifier(
    max_depth=1,
    criterion='gini',
    random_state=42
)

ada = AdaBoostClassifier(
    estimator=stump,
    random_state=42
)

# Define broader distributions for randomized search
param_dist_ada = {
    'n_estimators': np.arange(50, 500, 50),  
    'learning_rate': np.arange(0.01, 0.5, 0.05),  
    'estimator__max_depth': np.arange(1, 4),     
    'estimator__criterion': ['gini', 'entropy']
}

random_search_ada = RandomizedSearchCV(
    estimator=ada,
    param_distributions=param_dist_ada,
    n_iter=30,                
    scoring='recall',
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=2
)

random_search_ada.fit(X_train_m12, y_train_m12)

print("Best Params (RandomSearchCV):", random_search_ada.best_params_)
print("Best CV Recall (RandomSearchCV):", random_search_ada.best_score_)

# Best tuned AdaBoost model from RandomSearchCV
m12_Ada_tuned_r = random_search_ada.best_estimator_
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Params (RandomSearchCV): {'n_estimators': 100, 'learning_rate': 0.41000000000000003, 'estimator__max_depth': 1, 'estimator__criterion': 'gini'}
Best CV Recall (RandomSearchCV): 0.8388030888030888
CPU times: total: 6.78 s
Wall time: 5min 54s
In [197]:
# Checking Model Performance on Training dataset
m12_Ada_tuned_r_perf_train = model_performance_classification_sklearn(m12_Ada_tuned_r, X_train_m12, y_train_m12)
m12_Ada_tuned_r_perf_train
Out[197]:
Accuracy Recall Precision F1
0 0.839 0.856 0.828 0.842
In [198]:
plot_confusion_matrix(m12_Ada_tuned_r, X_train_m12, y_train_m12)
No description has been provided for this image
In [199]:
# Checking Model Performance on Validation dataset
m12_Ada_tuned_r_perf_val = model_performance_classification_sklearn(m12_Ada_tuned_r, X_val_m12, y_val_m12)
m12_Ada_tuned_r_perf_val
Out[199]:
Accuracy Recall Precision F1
0 0.780 0.720 0.591 0.649
In [200]:
plot_confusion_matrix(m12_Ada_tuned_r, X_val_m12, y_val_m12)
No description has been provided for this image

Model 13 - Gradient Boosting - Tuned¶

In [151]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m13 = le.fit_transform(y_train_final)
y_val_m13 = le.transform(y_val_final)
y_test_m13 = le.transform(y_test_final)

X_train_m13 = X_train_final.copy()
X_val_m13 = X_val_final.copy()
X_test_m13 = X_test_final.copy()
In [157]:
%%time
#Build Tuned Gradient Boosting - RandomSearchCV

# Base model
gb_base = GradientBoostingClassifier(
    random_state=42
)

param_dist_gb = {
    'n_estimators': np.arange(100, 501, 50),       # 100 to 500 trees in steps of 50
    'learning_rate': np.arange(0.01, 0.21, 0.03),  # 0.01 to 0.2 in steps of 0.03
    'max_depth': np.arange(2, 6),                  # 2, 3, 4, 5
    'subsample': np.arange(0.6, 1.1, 0.1)          # 0.6 to 1.0 in steps of 0.1
}

gb_random = RandomizedSearchCV(
    estimator=gb_base,
    param_distributions=param_dist_gb,
    n_iter=30,               
    scoring='recall',         
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=2
)

# Fit on training data
gb_random.fit(X_train_m13, y_train_m13)

print("Best Params (Gradient Boosting, RandomizedSearchCV):", gb_random.best_params_)
print("Best CV Recall (Gradient Boosting, RandomizedSearchCV):", gb_random.best_score_)

# Best tuned Gradient Boosting model from RandomizedSearchCV
m13_GB_tuned_r = gb_random.best_estimator_
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Params (Gradient Boosting, RandomizedSearchCV): {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.04}
Best CV Recall (Gradient Boosting, RandomizedSearchCV): 0.8362934362934362
CPU times: total: 12 s
Wall time: 7min 39s
In [201]:
# Checking Model Performance on Training dataset
m13_GB_tuned_r_perf_train = model_performance_classification_sklearn(m13_GB_tuned_r, X_train_m13, y_train_m13)
m13_GB_tuned_r_perf_train
Out[201]:
Accuracy Recall Precision F1
0 0.858 0.867 0.851 0.859
In [202]:
plot_confusion_matrix(m13_GB_tuned_r, X_train_m13, y_train_m13)
No description has been provided for this image
In [203]:
# Checking Model Performance on Validation dataset
m13_GB_tuned_r_perf_val = model_performance_classification_sklearn(m13_GB_tuned_r, X_val_m13, y_val_m13)
m13_GB_tuned_r_perf_val
Out[203]:
Accuracy Recall Precision F1
0 0.785 0.676 0.606 0.639
In [204]:
plot_confusion_matrix(m13_GB_tuned_r, X_val_m13, y_val_m13)
No description has been provided for this image

Model 14 - XGBoost - Tuned¶

In [162]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m14 = le.fit_transform(y_train_final)
y_val_m14 = le.transform(y_val_final)
y_test_m14 = le.transform(y_test_final)

X_train_m14 = X_train_final.copy()
X_val_m14 = X_val_final.copy()
X_test_m14 = X_test_final.copy()
In [195]:
%%time
#Build Tuned XGBoost - RandomSearchCV

xgb_base = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)

param_dist_xgb = {
    'n_estimators': np.arange(100, 501, 50),        
    'learning_rate': np.arange(0.01, 0.21, 0.03),   
    'max_depth': np.arange(3, 8),                  
    'subsample': np.arange(0.5, 1.01, 0.1),         
    'colsample_bytree': np.arange(0.5, 1.01, 0.1),  
    'min_child_weight': np.arange(1, 8, 2),         
    'reg_lambda': np.arange(0.0, 5.5, 0.5),         
    'reg_alpha': np.arange(0.0, 2.1, 0.3)           
}

xgb_random = RandomizedSearchCV(
    estimator=xgb_base,
    param_distributions=param_dist_xgb,
    n_iter=40,               
    scoring='recall',        
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=2
)

xgb_random.fit(X_train_m14, y_train_m14)

print("Best Params (XGBoost, RandomizedSearchCV):", xgb_random.best_params_)
print("Best CV Recall (XGBoost, RandomizedSearchCV):", xgb_random.best_score_)

# Best model from RandomizedSearchCV
m14_XGB_tuned_r = xgb_random.best_estimator_
Fitting 5 folds for each of 40 candidates, totalling 200 fits
Best Params (XGBoost, RandomizedSearchCV): {'subsample': 0.6, 'reg_lambda': 2.5, 'reg_alpha': 1.2, 'n_estimators': 250, 'min_child_weight': 1, 'max_depth': 5, 'learning_rate': 0.01, 'colsample_bytree': 0.7}
Best CV Recall (XGBoost, RandomizedSearchCV): 0.8420849420849421
CPU times: total: 6.05 s
Wall time: 57.9 s
In [164]:
# Checking Model Performance on Training dataset
m14_XGB_tuned_r_perf_train = model_performance_classification_sklearn(m14_XGB_tuned_r, X_train_m14, y_train_m14)
m14_XGB_tuned_r_perf_train
Out[164]:
Accuracy Recall Precision F1
0 0.847 0.865 0.834 0.850
In [165]:
plot_confusion_matrix(m14_XGB_tuned_r, X_train_m14, y_train_m14)
No description has been provided for this image
In [166]:
# Checking Model Performance on Validation dataset
m14_XGB_tuned_r_perf_val = model_performance_classification_sklearn(m14_XGB_tuned_r, X_val_m14, y_val_m14)
m14_XGB_tuned_r_perf_val
Out[166]:
Accuracy Recall Precision F1
0 0.784 0.717 0.598 0.652
In [167]:
plot_confusion_matrix(m14_XGB_tuned_r, X_val_m14, y_val_m14)
No description has been provided for this image

Model 15 - Stacking Model¶

In [168]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m15 = le.fit_transform(y_train_final)
y_val_m15 = le.transform(y_val_final)
y_test_m15 = le.transform(y_test_final)

X_train_m15 = X_train_final.copy()
X_val_m15 = X_val_final.copy()
X_test_m15 = X_test_final.copy()
In [223]:
%%time

# Lets Build a Stacking Classifier with 3 strong performers
# Ridge Logistic Regression
# Model 9: Gradient Boosting
# Model 10: Adaboost
#The meta-model learns how to best combine the predictions of the base models - XGBoost

estimators = [
    ('ridge_logit', m11_Logit_tuned_g),            
    ('ada', m12_Ada_tuned_r),    
    ('gradboost', m13_GB_tuned_r)
]

final_xgb_meta = XGBClassifier(
    n_estimators=200,
    max_depth=3,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    random_state=42,
    scale_pos_weight=3
)

m15_stack = StackingClassifier(
    estimators=estimators,
    final_estimator=final_xgb_meta,
    cv=5,                
    n_jobs=-1,
    stack_method='predict_proba',  # use predicted probabilities
)

# Fit the stacked model
m15_stack.fit(X_train_m15, y_train_m15)
CPU times: total: 1.58 s
Wall time: 27.1 s
Out[223]:
StackingClassifier(cv=5,
                   estimators=[('ridge_logit',
                                LogisticRegression(C=1, penalty='l1',
                                                   random_state=42,
                                                   solver='liblinear')),
                               ('ada',
                                AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1,
                                                                                    random_state=42),
                                                   learning_rate=0.41000000000000003,
                                                   n_estimators=100,
                                                   random_state=42)),
                               ('gradboost',
                                GradientBoostingClassifier(learning_rate=0.04,
                                                           max_depth=...
                                                 importance_type=None,
                                                 interaction_constraints=None,
                                                 learning_rate=0.05,
                                                 max_bin=None,
                                                 max_cat_threshold=None,
                                                 max_cat_to_onehot=None,
                                                 max_delta_step=None,
                                                 max_depth=3, max_leaves=None,
                                                 min_child_weight=None,
                                                 missing=nan,
                                                 monotone_constraints=None,
                                                 multi_strategy=None,
                                                 n_estimators=200, n_jobs=None,
                                                 num_parallel_tree=None,
                                                 random_state=42, ...),
                   n_jobs=-1, stack_method='predict_proba')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StackingClassifier(cv=5,
                   estimators=[('ridge_logit',
                                LogisticRegression(C=1, penalty='l1',
                                                   random_state=42,
                                                   solver='liblinear')),
                               ('ada',
                                AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1,
                                                                                    random_state=42),
                                                   learning_rate=0.41000000000000003,
                                                   n_estimators=100,
                                                   random_state=42)),
                               ('gradboost',
                                GradientBoostingClassifier(learning_rate=0.04,
                                                           max_depth=...
                                                 importance_type=None,
                                                 interaction_constraints=None,
                                                 learning_rate=0.05,
                                                 max_bin=None,
                                                 max_cat_threshold=None,
                                                 max_cat_to_onehot=None,
                                                 max_delta_step=None,
                                                 max_depth=3, max_leaves=None,
                                                 min_child_weight=None,
                                                 missing=nan,
                                                 monotone_constraints=None,
                                                 multi_strategy=None,
                                                 n_estimators=200, n_jobs=None,
                                                 num_parallel_tree=None,
                                                 random_state=42, ...),
                   n_jobs=-1, stack_method='predict_proba')
LogisticRegression(C=1, penalty='l1', random_state=42, solver='liblinear')
DecisionTreeClassifier(max_depth=1, random_state=42)
DecisionTreeClassifier(max_depth=1, random_state=42)
GradientBoostingClassifier(learning_rate=0.04, max_depth=4, random_state=42,
                           subsample=0.6)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.05, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=200, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...)
In [170]:
# Checking Model Performance on Training dataset
m15_stack_perf_train = model_performance_classification_sklearn(m15_stack, X_train_m15, y_train_m15)
m15_stack_perf_train
Out[170]:
Accuracy Recall Precision F1
0 0.817 0.939 0.755 0.837
In [171]:
plot_confusion_matrix(m15_stack, X_train_m15, y_train_m15)
No description has been provided for this image
In [172]:
# Checking Model Performance on Validation dataset
m15_stack_perf_val = model_performance_classification_sklearn(m15_stack, X_val_m15, y_val_m15)
m15_stack_perf_val
Out[172]:
Accuracy Recall Precision F1
0 0.731 0.850 0.514 0.640
In [173]:
plot_confusion_matrix(m15_stack, X_val_m15, y_val_m15)
No description has been provided for this image

Model 16 - SVM¶

In [174]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m16 = le.fit_transform(y_train_final)
y_val_m16 = le.transform(y_val_final)
y_test_m16 = le.transform(y_test_final)

X_train_m16 = X_train_final.copy()
X_val_m16 = X_val_final.copy()
X_test_m16 = X_test_final.copy()
In [190]:
%%time
# Build SVM Model
# Initialize SVM classifier

# Base SVM model
svm_base = SVC(
    probability=False,                   # skip probability estimates → faster
    class_weight='balanced',             # handle imbalance natively
    cache_size=700,                      # allocate more memory to speed up kernel computation
    tol=1e-3,                            # slightly relaxed tolerance to converge faster
    max_iter=2000,                       # reasonable upper bound on iterations
    random_state=42
)

# Define parameter distribution for random search
param_dist_svm = {
    'C': np.logspace(-2, 1, 20),         # smaller, focused range (0.01 → 100)
    'gamma': np.logspace(-3, 0, 10),     # reasonable kernel width values
    'kernel': ['rbf'],                   # RBF only (most effective for churn)
}

# Randomized search setup
svm_random = RandomizedSearchCV(
    estimator=svm_base,
    param_distributions=param_dist_svm,
    n_iter=15,                           # limit to 15 smart random combinations
    scoring='recall',                    # optimize recall (catch churners)
    cv=5,                                # 5-fold CV
    n_jobs=-1,                           # use all cores
    verbose=1,
    random_state=42
)

svm_random.fit(X_train_m16, y_train_m16)

print("Best Params (SVM - RandomSearchCV):", svm_random.best_params_)
print("Best CV Recall (SVM - RandomSearchCV):", svm_random.best_score_)

m16_svm = svm_random.best_estimator_
Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best Params (SVM - RandomSearchCV): {'kernel': 'rbf', 'gamma': 0.001, 'C': 4.832930238571752}
Best CV Recall (SVM - RandomSearchCV): 0.946911196911197
CPU times: total: 1.58 s
Wall time: 23.9 s
In [191]:
# Checking Model Performance on Training dataset
m16_svm_perf_train = model_performance_classification_sklearn(m16_svm, X_train_m16, y_train_m16)
m16_svm_perf_train
Out[191]:
Accuracy Recall Precision F1
0 0.480 0.901 0.489 0.634
In [192]:
plot_confusion_matrix(m16_svm, X_train_m16, y_train_m16)
No description has been provided for this image
In [193]:
# Checking Model Performance on Validation dataset
m16_svm_perf_val = model_performance_classification_sklearn(m16_svm, X_val_m16, y_val_m16)
m16_svm_perf_val
Out[193]:
Accuracy Recall Precision F1
0 0.306 0.915 0.278 0.427
In [194]:
plot_confusion_matrix(m16_svm, X_val_m16, y_val_m16)
No description has been provided for this image

Model 17 - ANN¶

In [176]:
# Make a copy of final datasets & label encode the target variable
le = LabelEncoder()
y_train_m17 = le.fit_transform(y_train_final)
y_val_m17 = le.transform(y_val_final)
y_test_m17 = le.transform(y_test_final)

X_train_m17 = X_train_final.copy()
X_val_m17 = X_val_final.copy()
X_test_m17 = X_test_final.copy()
In [177]:
# Build ANN Model
tf.keras.backend.clear_session()
m17_ann = Sequential([
    Dense(128, activation='relu', input_dim=X_train_m17.shape[1]),
    BatchNormalization(),
    Dropout(0.4),

    Dense(64, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),

    Dense(32, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),

    Dense(1, activation='sigmoid')  # output: churn probability
])

m17_ann.summary()
WARNING:tensorflow:From C:\Users\Maheep-Zephyrus\anaconda3\Lib\site-packages\keras\src\backend\common\global_state.py:82: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense (Dense)                        │ (None, 128)                 │           6,272 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization                  │ (None, 128)                 │             512 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout (Dropout)                    │ (None, 128)                 │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 64)                  │           8,256 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_1                │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_1 (Dropout)                  │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense)                      │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_2                │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 (Dropout)                  │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_3 (Dense)                      │ (None, 1)                   │              33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 17,537 (68.50 KB)
 Trainable params: 17,089 (66.75 KB)
 Non-trainable params: 448 (1.75 KB)
In [178]:
optimizer = Adam(learning_rate=0.0008)

m17_ann.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.Precision(name='precision')
    ]
)


# We stop when validation recall stops improving, because Recall is the primary KPI for churn.
early_stop = EarlyStopping(
    monitor='val_recall',
    mode='max',          # we want to maximize recall
    patience=10,
    restore_best_weights=True,
    verbose=1
)

history = m17_ann.fit(
    X_train_m17, y_train_m17,
    validation_data=(X_val_m17, y_val_m17),
    epochs=200,
    batch_size=64,
    callbacks=[early_stop],
    verbose=1
)
Epoch 1/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 6s 10ms/step - accuracy: 0.6118 - loss: 0.7505 - precision: 0.6117 - recall: 0.6015 - val_accuracy: 0.7041 - val_loss: 0.5836 - val_precision: 0.4857 - val_recall: 0.8262
Epoch 2/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 2s 6ms/step - accuracy: 0.7315 - loss: 0.5813 - precision: 0.7188 - recall: 0.7583 - val_accuracy: 0.7190 - val_loss: 0.5442 - val_precision: 0.5014 - val_recall: 0.8174
Epoch 3/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7474 - loss: 0.5277 - precision: 0.7278 - recall: 0.7832 - val_accuracy: 0.7261 - val_loss: 0.5336 - val_precision: 0.5091 - val_recall: 0.8203
Epoch 4/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7618 - loss: 0.5068 - precision: 0.7526 - recall: 0.7836 - val_accuracy: 0.7203 - val_loss: 0.5389 - val_precision: 0.5027 - val_recall: 0.8306
Epoch 5/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7743 - loss: 0.4983 - precision: 0.7634 - recall: 0.7996 - val_accuracy: 0.7211 - val_loss: 0.5413 - val_precision: 0.5035 - val_recall: 0.8365
Epoch 6/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7698 - loss: 0.4881 - precision: 0.7565 - recall: 0.8005 - val_accuracy: 0.7382 - val_loss: 0.5217 - val_precision: 0.5230 - val_recall: 0.8203
Epoch 7/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7770 - loss: 0.4826 - precision: 0.7596 - recall: 0.8105 - val_accuracy: 0.7448 - val_loss: 0.5053 - val_precision: 0.5322 - val_recall: 0.7909
Epoch 8/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7833 - loss: 0.4713 - precision: 0.7735 - recall: 0.8043 - val_accuracy: 0.7352 - val_loss: 0.5104 - val_precision: 0.5195 - val_recall: 0.8233
Epoch 9/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7790 - loss: 0.4746 - precision: 0.7596 - recall: 0.8147 - val_accuracy: 0.7365 - val_loss: 0.5129 - val_precision: 0.5209 - val_recall: 0.8247
Epoch 10/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7811 - loss: 0.4719 - precision: 0.7619 - recall: 0.8239 - val_accuracy: 0.7444 - val_loss: 0.5040 - val_precision: 0.5308 - val_recall: 0.8115
Epoch 11/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7850 - loss: 0.4602 - precision: 0.7670 - recall: 0.8166 - val_accuracy: 0.7352 - val_loss: 0.5151 - val_precision: 0.5197 - val_recall: 0.8159
Epoch 12/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7911 - loss: 0.4584 - precision: 0.7748 - recall: 0.8250 - val_accuracy: 0.7365 - val_loss: 0.5128 - val_precision: 0.5213 - val_recall: 0.8115
Epoch 13/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7833 - loss: 0.4616 - precision: 0.7610 - recall: 0.8192 - val_accuracy: 0.7315 - val_loss: 0.5112 - val_precision: 0.5154 - val_recall: 0.8130
Epoch 14/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7899 - loss: 0.4550 - precision: 0.7718 - recall: 0.8211 - val_accuracy: 0.7269 - val_loss: 0.5143 - val_precision: 0.5100 - val_recall: 0.8247
Epoch 15/200
162/162 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7943 - loss: 0.4509 - precision: 0.7719 - recall: 0.8302 - val_accuracy: 0.7253 - val_loss: 0.5205 - val_precision: 0.5081 - val_recall: 0.8351
Epoch 15: early stopping
Restoring model weights from the end of the best epoch: 5.
In [179]:
def plot(history, name):
    """
    Function to plot loss/accuracy

    history: an object which stores the metrics and losses.
    name: can be one of Loss or Accuracy
    """
    fig, ax = plt.subplots() #Creating a subplot with figure and axes.
    plt.plot(history.history[name]) #Plotting the train accuracy or train loss
    plt.plot(history.history['val_'+name]) #Plotting the validation accuracy or validation loss

    plt.title('Model ' + name.capitalize()) #Defining the title of the plot.
    plt.ylabel(name.capitalize()) #Capitalizing the first letter.
    plt.xlabel('Epoch') #Defining the label for the x-axis.
    fig.legend(['Train', 'Validation'], loc="outside right upper") #Defining the legend, loc controls the position of the legend.
In [180]:
plot(history,'loss')
No description has been provided for this image
In [181]:
plot(history,'recall')
No description has been provided for this image
In [211]:
# Show Training Performance Metrics

# Get predicted probabilities from ANN
y_train_pred_prob_m17 = m17_ann.predict(X_train_m17)
y_val_pred_prob_m17   = m17_ann.predict(X_val_m17)
y_test_pred_prob_m17   = m17_ann.predict(X_test_m17)

# Convert probabilities to binary predictions using best threshold
y_train_pred_m17 = (y_train_pred_prob_m17 >= 0.5).astype(int)
y_val_pred_m17   = (y_val_pred_prob_m17 >= 0.5).astype(int)
y_test_pred_m17   = (y_test_pred_prob_m17 >= 0.5).astype(int)

m17_ann_perf_train = pd.DataFrame({
    'Accuracy':  [accuracy_score(y_train_m17, y_train_pred_m17)],
    'Recall':    [recall_score(y_train_m17, y_train_pred_m17)],
    'Precision': [precision_score(y_train_m17, y_train_pred_m17)],
    'F1':        [f1_score(y_train_m17, y_train_pred_m17)]
})
m17_ann_perf_train
324/324 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
76/76 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
76/76 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Out[211]:
Accuracy Recall Precision F1
0 0.779 0.864 0.738 0.796
In [212]:
# Show Validation Performance Metrics
m17_ann_perf_val = pd.DataFrame({
    'Accuracy':  [accuracy_score(y_val_m17, y_val_pred_m17)],
    'Recall':    [recall_score(y_val_m17, y_val_pred_m17)],
    'Precision': [precision_score(y_val_m17, y_val_pred_m17)],
    'F1':        [f1_score(y_val_m17, y_val_pred_m17)]
})
m17_ann_perf_val
Out[212]:
Accuracy Recall Precision F1
0 0.721 0.837 0.504 0.629
In [184]:
# Confusion Matrix for Training Dataset

cm_ann_t = confusion_matrix(y_train_m17, y_train_pred_m17)
labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm_ann_t.flatten().sum())]
            for item in cm_ann_t.flatten()
        ]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm_ann_t, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Out[184]:
Text(0.5, 14.722222222222216, 'Predicted label')
No description has been provided for this image
In [185]:
# Confusion Matrix for Validation Dataset

cm_ann_v = confusion_matrix(y_val_m17, y_val_pred_m17)
labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm_ann_v.flatten().sum())]
            for item in cm_ann_v.flatten()
        ]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm_ann_v, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Out[185]:
Text(0.5, 14.722222222222216, 'Predicted label')
No description has been provided for this image

Model Performance Comparison and Final Model Selection¶

In [207]:
model_comparison_df_tuned = []
# Dictionary to map model index
model_names_tuned = {
    1: "Logistic Regression - Tuned",
    2: "AdaBoost - Tuned",
    3: "Gradient Boosting - Tuned",
    4: "XGBoost - Tuned",
    5: "Stacking Model",
    6: "SVM",
    7: "ANN"
}

# Dictionary to map model index
model_perf_dfs_tuned = {
    1: (m11_Logit_tuned_g_perf_train,  m11_Logit_tuned_g_perf_val),
    2: (m12_Ada_tuned_r_perf_train,  m12_Ada_tuned_r_perf_val),
    3: (m13_GB_tuned_r_perf_train,  m13_GB_tuned_r_perf_val),
    4: (m14_XGB_tuned_r_perf_train,  m14_XGB_tuned_r_perf_val),
    5: (m15_stack_perf_train,  m15_stack_perf_val),
    6: (m16_svm_perf_train,  m16_svm_perf_val),
    7: (m17_ann_perf_train,  m17_ann_perf_val)
}

# Build rows for the final summary table
rows = []

for model_id, (train_df, val_df) in model_perf_dfs_tuned.items():
    train_recall = train_df["Recall"].iloc[0]
    val_recall = val_df["Recall"].iloc[0]
    rows.append({
        "Model No.": model_id,
        "Model Name": model_names_tuned[model_id],
        "Training Recall":  train_df["Recall"].iloc[0],
        "Validation Recall": val_df["Recall"].iloc[0],
        "Recall Gap (Train-Val)": round(train_recall - val_recall, 4),
        "Training F1":      train_df["F1"].iloc[0],
        "Validation F1":    val_df["F1"].iloc[0]
    })

# Create final comparison table
model_comparison_df_tuned = pd.DataFrame(rows)

# Sort by Validation Recall (descending) to see best churn-catchers first
model_comparison_df_tuned = model_comparison_df_tuned.sort_values(
    by="Validation Recall", ascending=False
).reset_index(drop=True)

model_comparison_df_tuned
Out[207]:
Model No. Model Name Training Recall Validation Recall Recall Gap (Train-Val) Training F1 Validation F1
0 6 SVM 0.901 0.915 -0.014 0.634 0.427
1 5 Stacking Model 0.939 0.850 0.089 0.837 0.640
2 7 ANN 0.864 0.837 0.028 0.796 0.629
3 1 Logistic Regression - Tuned 0.812 0.789 0.023 0.793 0.655
4 2 AdaBoost - Tuned 0.856 0.720 0.136 0.842 0.649
5 4 XGBoost - Tuned 0.865 0.717 0.148 0.850 0.652
6 3 Gradient Boosting - Tuned 0.867 0.676 0.191 0.859 0.639

Test Performances of Top 3 Models - SVM, Stacking Model & ANN¶

Test Performance - SVM¶
In [209]:
# Checking SVM Model Performance on Training dataset
m16_svm_perf_test = model_performance_classification_sklearn(m16_svm, X_test_m16, y_test_m16)
m16_svm_perf_test
Out[209]:
Accuracy Recall Precision F1
0 0.300 0.907 0.275 0.422
In [210]:
plot_confusion_matrix(m16_svm, X_test_m16, y_test_m16)
No description has been provided for this image
Test Performance - Stacking Model¶
In [216]:
# Checking Stack Model Performance on Training dataset
m15_stack_perf_test = model_performance_classification_sklearn(m15_stack, X_test_m15, y_test_m15)
m15_stack_perf_test
Out[216]:
Accuracy Recall Precision F1
0 0.741 0.848 0.526 0.649
In [217]:
plot_confusion_matrix(m15_stack, X_test_m15, y_test_m15)
No description has been provided for this image
Test Performance - ANN¶
In [213]:
m17_ann_perf_test = pd.DataFrame({
    'Accuracy':  [accuracy_score(y_test_m17, y_test_pred_m17)],
    'Recall':    [recall_score(y_test_m17, y_test_pred_m17)],
    'Precision': [precision_score(y_test_m17, y_test_pred_m17)],
    'F1':        [f1_score(y_test_m17, y_test_pred_m17)]
})
m17_ann_perf_test
Out[213]:
Accuracy Recall Precision F1
0 0.738 0.837 0.522 0.643
In [214]:
# Confusion Matrix for Test Dataset

cm_ann_t = confusion_matrix(y_test_m17, y_test_pred_m17)
labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm_ann_t.flatten().sum())]
            for item in cm_ann_t.flatten()
        ]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm_ann_t, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Out[214]:
Text(0.5, 14.722222222222216, 'Predicted label')
No description has been provided for this image
Final Model Selection Based on Test Performance¶
In [218]:
model_comparison_df_final = []
# Dictionary to map model index
model_names_final = {
    1: "SVM",
    2: "Stacking Model",
    3: "ANN"
}

# Dictionary to map model index
model_perf_dfs_final = {
    1: (m16_svm_perf_train,  m16_svm_perf_test),
    2: (m15_stack_perf_train,  m15_stack_perf_test),
    3: (m17_ann_perf_train,  m17_ann_perf_test)
}

# Build rows for the final summary table
rows = []

for model_id, (train_df, test_df) in model_perf_dfs_final.items():
    train_recall = train_df["Recall"].iloc[0]
    test_recall = test_df["Recall"].iloc[0]
    rows.append({
        "Model No.": model_id,
        "Model Name": model_names_final[model_id],
        "Training Recall":  train_df["Recall"].iloc[0],
        "Test Recall": test_df["Recall"].iloc[0],
        "Recall Gap (Train-Test)": round(train_recall - test_recall, 4),
        "Training F1":      train_df["F1"].iloc[0],
        "Test F1":    test_df["F1"].iloc[0]
    })

# Create final comparison table
model_comparison_df_final = pd.DataFrame(rows)

# Sort by Validation Recall (descending) to see best churn-catchers first
model_comparison_df_final = model_comparison_df_final.sort_values(
    by="Test Recall", ascending=False
).reset_index(drop=True)

model_comparison_df_final
Out[218]:
Model No. Model Name Training Recall Test Recall Recall Gap (Train-Test) Training F1 Test F1
0 1 SVM 0.901 0.907 -0.006 0.634 0.422
1 2 Stacking Model 0.939 0.848 0.090 0.837 0.649
2 3 ANN 0.864 0.837 0.028 0.796 0.643

Actionable Insights & Recommendations¶

Deployment Recommendation:-

  • Deploy two complementary models — Stacking Model and Artificial Neural Network (ANN) — to balance recall coverage, interpretability, and generalization.

  • This approach maximizes churn capture while maintaining operational efficiency and scalability across customer segments.

  • Please refer Business Report for more details